In addition to @muru's answer pointing out that find doesn't support \s (so you need to use [:space:] instead):
- Instead of
A-Za-z, you can use the posix character class [:alpha:], and instead of A-Za-z0-9 you can use [:alnum:]. And [:digit:] instead of 0-9. So your find command using extended regexps could be written as:
find . -type f -regextype posix-extended ! -regex '^\.\/[[:alpha:]’'\'']+[[:digit:]]{2,4}[-[:alnum:],&_#\(\)[:space:]\.\’’'\'']{0,99}[\.][[:alnum:]]{3,4}'
This is shorter (except for [[:digit:]]) and, IMO, easier to read.
- If you want to use perl regular expressions, then consider using perl to do the matching.
It's been a very long time since I last used a Mac so I don't know if OSX still comes with perl pre-installed. You may have to install it, maybe with Homebrew. Anyway, as with all answers on this site, this answer is written not only for you but also for everyone who reads it afterwards.
For example:
find . -type f -print0 |
perl -0ne "chomp; print "$_\n" unless m=^\./[[:alpha:]’']+\d{2,4}[-[:alnum:],&_#\(\)\s\.’’']{0,99}\.[[:alnum:]]{3,4}="
The chomp function is used to remove the line-ending character(s) at the end of each filename - NUL in this case, because of the -0 option. The default line-ending on Unix (including OSX) is \n but it's not uncommon to use \r\n (Windows) or just \r (ancient MacOS, before OSX).
Note that find's -print0 predicate outputs each filename separated by NUL characters, and perl's -0 option reads NUL-separated input, but the one-liner is written to output a newline between each filename. This is kind of broken (i.e. the output won't be correct if any of the file or directory names contain a newline), and is only done here so that the output is human-readable. If you intend to pipe the output into another program, you'd be better off using NUL separators all the way through the pipeline. For example:
find . -type f -print0 |
perl -0lne "print unless m=^\./[[:alpha:]’']+\d{2,4}[-[:alnum:],&_#\(\)\s\.’’']{0,99}\.[[:alnum:]]{3,4}="
BTW, piping find ... -print0's output into GNU grep -zvP and a perl-ish regex would also work. GNU grep's -P option uses Perl Compatible Regular Expressions or PCRE. Most (all?) other versions of grep, including the FreeBSD grep used in OSX, do not support -P. The output would be NUL-separated so if you want to make it more human-readable, pipe into tr '\000' '\n' after grep.
- Using perl to do the regex matching also allows you to use
\w instead of [:alpha:] and \w\d instead of [:alnum:], which is shorter and IMO even easier to read and understand. Unlike find, in perlre you can use \w, \d, \s etc inside a bracketed expression.
Note that the exact meanings of \w and \d can vary depending on the current locale and unicode settings.
See perlre and search for "Which character set modifier is in effect?" and perlunicode for details. perlrecharclass is also worth reading.
find . -type f -print0 |
perl -0ne "chomp; print "$_\n" unless m=^\./[\w’']+\d{2,4}[-\w\d,&_#\(\)\s\.’’']{0,99}\.[\w\d]{3,4}="
Also worth noting: \w matches underscores (_) as well as "word" characters (e.g. ASCII [A-Za-z]). You already had _ in the regex for fields 2 & 3, but the above uses \w for fields 1 & 4 too. This may not be what you want.
- For a fancier version, it's easier to use a stand-alone script (but the following could be written as a one-liner - without all the comments and newlines - if you're a masochist who enjoys editing extremely long lines in your shell):
Note: i'm using ^.*/ instead of ^\./ in this version so that it matches any path, not just ./.
#!/usr/bin/perl -0n
BEGIN {
# First, add the three regexes to an array. (so it's easier to
# understand and edit them independently of each other)
#
# The regexes are pre-compiled using perl's qr() quoting
# operator (see `perldoc -f qr`)
my @fields;
# Using posix character classes only
# push @fields, qr(^.*/[[:alpha:]’']+[[:digit:]]{2,4});
# push @fields, qr=[-[:alnum:],&_#\(\)[:space:]\.’’']{0,99}=;
# push @fields, qr(\.[[:alnum:]]{3,4});
# Using perl's \w, \d, and \s
# note that \w also matches underscores.
push @fields, qr(^.*/[\w’']+\d{2,4});
push @fields, qr=[-\w\d,&#\(\)\s\.’’']{0,99}=;
push @fields, qr(\.[\w\d]{3,4});
# then merge them into one scalar variable
$re = join("", @fields);
};
chomp;
print "$_\n" unless /$re/;
Save that somewhere in your $PATH (e.g. ~/bin or /usr/local/bin) as, e.g., match-non-compliant-filenames.pl and make it executable with chmod +x. Then you could run the find command like so:
find . -type f -print0 | match-non-compliant-filenames.pl
If you're going to use a stand-alone perl script to do the filename matching, then it's worth noting that perl has a File::Find library module which performs the same basic function as find. It can be used to create a completely stand-alone script that doesn't need to have input piped in from find.
File::Find doesn't have all of find's predicates built-in, but you can use any perl code to determine which filenames match and which don't, and what to do with them (e.g. print the filenames or delete or rename them or add them to an array for later use in the script).
And perl's Getopt::Std can be used to process short, single-letter command-line options (alternatively, use Getopt::Long to process both short and long options).
All three of these modules are part of perl's core library and are included with perl.
For an example of what's possible:
#!/usr/bin/perl
use v5.14;
use File::Find;
use Getopt::Std;
my @fields;
my $re;
# default to using NUL as the output filename separator
my $sep = "\0";
# -l option changes output separator to newline
# -d enables printing some debugging info
getopts('ld', \my %opts);
$sep = "\n" if $opts{l};
printf "sep = 0x%02x\n", ord($sep) if $opts{d};
# This script will search in all directories specified on the
# command line. If none are specified, default to searching in
# the current directory (same as what find does)
push @ARGV, '.' unless @ARGV;
print "ARGV = ", join(", ", @ARGV), "\n" if $opts{d};
# First, add the three regexes to an array called @fields. (so
# it's easier to understand and edit them independently)
#
# The regexes are pre-compiled using perl's qr() quoting operator
# (see `perldoc -f qr`)
# Note that in this script, there's no need to prefix the
# first field with ^\./ or ^.*/ because the wanted() function called
# by File::Find is written to only match against the file's basename,
# not the full pathname.
# Using posix character classes only
# push @fields, qr([[:alpha:]’']+[[:digit:]]{2,4});
# push @fields, qr=[-[:alnum:],&_#\(\)[:space:]\.’’']{0,99}=;
# push @fields, qr(\.[[:alnum:]]{3,4});
# Using perl's \w, \d, and \s
# note that \w also matches underscores.
push @fields, qr([\w’']+\d{2,4});
push @fields, qr=[-\w\d,&#\(\)\s\.’’']{0,99}=;
push @fields, qr(\.[\w\d]{3,4});
# then merge them into one scalar variable
$re = join("", @fields);
print "re = '$re'\n" if $opts{d};
# find the non-compliant files with a depth-first search
File::Find::finddepth({wanted => \&wanted}, @ARGV);
sub wanted {
next unless -f ;
print "$File::Find::name$sep" unless m/$re/;
};
Save in your $PATH as, e.g., find-non-compliant-filenames.pl, make executable with chmod +x find-non-compliant-filenames.pl.
Sample output (using -d which you wouldn't normally use):
$ find-non-compliant-filenames.pl -l -d testfldr/
sep = 0x0a
ARGV = testfldr/
re = '(?^u:[\w’']+\d{2,4})(?^u:[-\w\d,&_#\(\)\s\.’’']{0,99})(?^u:\.[\w\d]{3,4})'
- Whatever you decide to use (find alone, find + perl, perl alone, find + grep, find + awk, or something else entirely), if you use the same command or very similar commands frequently, it's worth the effort to write your own custom tools to perform your custom tasks. That's a large part of the point of shell (and perl and awk etc) scripting. Unix becomes far more useful & powerful when you realise that you can and should be a tool maker as well as a tool user.
\sis equivalent to[:space:].echo aaaaa | sed -Ee 's/(aa|aaaaa)/x/'printsx, i.e. it removes allas, using the second branch that gives a longest match; whileecho aaaaa | perl -pe 's/(aa|aaaaa)/x/'printsxaaa, preferring the leftmost branch, even though it results in a shorter match.grep -oEvs.grep -oP(on GNU) give a similar result. That's a bit of an odd difference, IMO. I'm not exactly sure what you mean by "most-repetition", though? Can you elaborate on that?