4

Can someone explain the issue I've described below, and perhaps suggest a resolution?

I'm trying to process a set of filenames using find. My platform is macOS (Ventura 13.7.7), and the shell is zsh ver 5.9. Potentially complicating this is the fact that the version of find I'm using is find (GNU findutils) 4.10.0 sourced by MacPorts; i.e. not the ancient version of find supplied by Apple.

I've "evolved" a file-naming convention based on the following three fields, plus the extension:

  1. Subject name & sequential number, which is required
  2. An optional "free text" description of the file
  3. A "time of creation" field - which is also optional, and for now is only a year
  4. The filename extension; e.g. .txt, .jpg, .mp4, etc.

The collection in my actual folder has thousands of files, and has been assembled over several years. As the collection has grown in size, I concluded that some sort of consistency check on the file names was needed. Since find has a -regex option, and is rather efficient, I'm trying to use it to identify those files which do not conform to the file-naming convention outlined above. However, I've found (so far) that there are apparently differences between the documentation, and how the actual code performs:

I've created a folder (testfldr), and some "test files" for purposes of this question:

% tree ./
./
├── DeborahKerr01-Black Narcissus-1947.jpg
├── Eisenhower01-byIrving Penn-1952.jpg
├── MaureenO'Sullivan03-by George Hurrell-1933.jpg
├── MoviePoster01-Lost Horizon-1937.jpg
└── Truman02.jpg

1 directory, 5 files

Following are the results using find with two supposedly equivalent specifications for white space:

% find . -regextype posix-extended ! -regex '^\.\/[A-Za-z’'\'']+[0-9]{2,4}[-A-Za-z0-9,&_#\(\)\s\.\’’'\'']{0,99}[\.][A-Za-z0-9]{3,4}' 
#                                                  ______________________  ___________________________________  ___________________ 
                                                       Field 1                     Fields 2 and 3                   Field 4
.
./MaureenO'Sullivan03-by George Hurrell-1933.jpg
./Eisenhower01-byIrving Penn-1952.jpg
./MoviePoster01-Lost Horizon-1937.jpg
./DeborahKerr01-Black Narcissus-1947.jpg 

% find . -regextype posix-extended ! -regex '^\.\/[A-Za-z’'\'']+[0-9]{2,4}[-A-Za-z0-9,&_#\(\)[:space:]\.\’’'\'']{0,99}[\.][A-Za-z0-9]{3,4}'
.

Note that the first run of find pegged 4 of the 5 files as "non-compliant", whereas the second run of find found no non-compliant files. As I've defined the fields, the second run is correct, while the first run is in error. The difference is the first run uses \s for whitespace, while the second run uses [:space:].

According to what I've read, and seen in an online regex tool I've used, \s is a "correct" specifier for whitespace (and equivalent to [:space:]) in the current POSIX implementation of regex.

What am I missing? Is this a bug?

6
  • regex101 doesn't even list POSIX regexes in the menu, as far as I can see, and that boost page says "The POSIX standard defines no escape sequences for POSIX-Extended regular expressions" Commented Oct 21 at 6:01
  • @ilkkachu: two things, FWIW - 1. regex101 says this re POSIX; 2. if you note, the boost library docs, they declare that \s is equivalent to [:space:]. Commented 2 days ago
  • 3
    I don't know what they have been smoking, but POSIX ERE are definitely not PCRE. (The latter should be compatible in that valid standard EREs should mean the same if interpreted as PCRE, as far as I've seen, but PCRE has a lot of extensions.) And, ok, I didn't quote it, but se the part after the bullets too: "The POSIX standard defines no escape sequences for POSIX-Extended regular expressions, except that: [three bullets, none of which are e.g. \s] However, that's rather restrictive, so the following standard-compatible extensions are also supported by Boost.Regex:" Commented 2 days ago
  • 1
    @ilkkachu "valid standard EREs should mean the same if interpreted as PCRE". Not exactly.. POSIX ERE and BRE work in leftmost-longest semantic, where as PCRE work in the leftmost most-repetition semantic. It is more correct to say that they share common subset. Commented yesterday
  • @DannyNiu, interesting. I can see e.g. echo aaaaa | sed -Ee 's/(aa|aaaaa)/x/' prints x, i.e. it removes all as, using the second branch that gives a longest match; while echo aaaaa | perl -pe 's/(aa|aaaaa)/x/' prints xaaa, preferring the leftmost branch, even though it results in a shorter match. grep -oE vs. grep -oP (on GNU) give a similar result. That's a bit of an odd difference, IMO. I'm not exactly sure what you mean by "most-repetition", though? Can you elaborate on that? Commented yesterday

3 Answers 3

5

If you have zsh, you don't need find (and BTW, FreeBSD's find is not old, it's just a different implementation from the GNU one, where extended regexps is just with -E like in grep or sed).

find . -regextype posix-extended ! -regex \
   '^\.\/[A-Za-z’'\'']+[0-9]{2,4}[-A-Za-z0-9,&_#\(\)\s\.\’’'\'']{0,99}[\.][A-Za-z0-9]{3,4}' 

Which should probably have been:

find . -regextype posix-extended -mindepth 1 -maxdepth 1 ! -regex \
   '^\./[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz’'\'']+[0123456789]{2,4}[-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,&_#()[:space:].’'\'']{0,99}\.[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789]{3,4}' \
   -printf '%P\n'

That is:

  • Don't use ranges such as A-Z a-z 0-9 which outside of the C/POSIX locale usually match on a wide range of characters, not just the ASCII ones you'd expect.
  • Don't use \ inside bracket expressions, or before things that are not regexp operators such as \/.
  • Restrict the depth to 1.
  • Avoid printing depth 0 component (. here).

In zsh, could just be written:

set -o extendedglob # best in ~/.zshrc
print -rC1 -- ^[A-Za-z’\']##[0-9](#c2,4)[-A-Za-z0-9[:space:]"&_#()’'"](#c0,99).[A-Za-z0-9](#c3,4)(ND)

Where

  • ^ is the equivalent of find's !
  • (#cx,y) the equivalent of ERE {x,y}.

With the additional benefits that:

  • a-z/A-Z/0-9 there do what you generally expect.
  • you get a sorted list
  • it works even if there are bytes that can't be decoded into characters.
5
  • Just a few thoughts/comments : 1. zsh is a good suggestion IMHO... I had no idea! 2. I generally use the "Unicode" alphabet in my filenames (e.g. é ç); does zsh a-z include those characters? 3. However... from strictly a file system perspective, I wonder if it might be "better" to convert to ASCII? Oh - please don't feel compelled to modify your answer; this comment is for my own benefit as much as anything else :) Commented 2 days ago
  • 1
    @Seamus, zsh's [a-z] only matches the 26 ASCII lower case letters. Elsewhere, YMMV. Some will include ç or 🆕, some will include multi-character collating elements (which would throw off your count) some won't. Most won't include ZzŹźŻżŽžᶻᷦẐẑẒẓẔẕℤℨ⒵ⓏⓩZz𝐙𝐳𝑍𝑧𝒁𝒛𝒵𝓏𝓩𝔃𝔷𝕫𝖅𝖟𝖹𝗓𝗭𝘇𝘡𝘻𝙕𝙯𝚉𝚣🄩🅉🅩🆉 as they sort after z. Commented 2 days ago
  • 1
    For alphanumeric characters (of any script, not just latin), you can use [[:alnum:]]. For latin alphabet letters only, including ź, æ and ç, best is probably to use perl and its \p{latin} match by character property. Not available in PCRE, so not in zsh's [[ -pcre-match ]] unfortunately. Commented 2 days ago
  • 1
    Using [[=a=][=b=]...[=z=]] in some regexp and glob engines (not zsh globs, but will work with its [[ =~ ]] if rematchpcre is not enabled and the system's ERE support it), could be another approach. Commented 2 days ago
  • 1
    That would miss some ligatures though such as ᴁ used in French or the German ß, and include symbols which may not be classified as alnum such as degree celsius symbols. Commented 2 days ago
4

In addition to @muru's answer pointing out that find doesn't support \s (so you need to use [:space:] instead):

  1. Instead of A-Za-z, you can use the posix character class [:alpha:], and instead of A-Za-z0-9 you can use [:alnum:]. And [:digit:] instead of 0-9. So your find command using extended regexps could be written as:
find . -type f -regextype posix-extended ! -regex '^\.\/[[:alpha:]’'\'']+[[:digit:]]{2,4}[-[:alnum:],&_#\(\)[:space:]\.\’’'\'']{0,99}[\.][[:alnum:]]{3,4}'

This is shorter (except for [[:digit:]]) and, IMO, easier to read.


  1. If you want to use perl regular expressions, then consider using perl to do the matching.

It's been a very long time since I last used a Mac so I don't know if OSX still comes with perl pre-installed. You may have to install it, maybe with Homebrew. Anyway, as with all answers on this site, this answer is written not only for you but also for everyone who reads it afterwards.

For example:

find . -type f -print0 |
  perl -0ne "chomp; print "$_\n" unless m=^\./[[:alpha:]’']+\d{2,4}[-[:alnum:],&_#\(\)\s\.’’']{0,99}\.[[:alnum:]]{3,4}="

The chomp function is used to remove the line-ending character(s) at the end of each filename - NUL in this case, because of the -0 option. The default line-ending on Unix (including OSX) is \n but it's not uncommon to use \r\n (Windows) or just \r (ancient MacOS, before OSX).

Note that find's -print0 predicate outputs each filename separated by NUL characters, and perl's -0 option reads NUL-separated input, but the one-liner is written to output a newline between each filename. This is kind of broken (i.e. the output won't be correct if any of the file or directory names contain a newline), and is only done here so that the output is human-readable. If you intend to pipe the output into another program, you'd be better off using NUL separators all the way through the pipeline. For example:

find . -type f -print0 |
  perl -0lne "print unless m=^\./[[:alpha:]’']+\d{2,4}[-[:alnum:],&_#\(\)\s\.’’']{0,99}\.[[:alnum:]]{3,4}="

BTW, piping find ... -print0's output into GNU grep -zvP and a perl-ish regex would also work. GNU grep's -P option uses Perl Compatible Regular Expressions or PCRE. Most (all?) other versions of grep, including the FreeBSD grep used in OSX, do not support -P. The output would be NUL-separated so if you want to make it more human-readable, pipe into tr '\000' '\n' after grep.


  1. Using perl to do the regex matching also allows you to use \w instead of [:alpha:] and \w\d instead of [:alnum:], which is shorter and IMO even easier to read and understand. Unlike find, in perlre you can use \w, \d, \s etc inside a bracketed expression.

Note that the exact meanings of \w and \d can vary depending on the current locale and unicode settings.

See perlre and search for "Which character set modifier is in effect?" and perlunicode for details. perlrecharclass is also worth reading.

find . -type f -print0 |
  perl -0ne "chomp; print "$_\n" unless m=^\./[\w’']+\d{2,4}[-\w\d,&_#\(\)\s\.’’']{0,99}\.[\w\d]{3,4}="

Also worth noting: \w matches underscores (_) as well as "word" characters (e.g. ASCII [A-Za-z]). You already had _ in the regex for fields 2 & 3, but the above uses \w for fields 1 & 4 too. This may not be what you want.


  1. For a fancier version, it's easier to use a stand-alone script (but the following could be written as a one-liner - without all the comments and newlines - if you're a masochist who enjoys editing extremely long lines in your shell):

Note: i'm using ^.*/ instead of ^\./ in this version so that it matches any path, not just ./.

#!/usr/bin/perl -0n

BEGIN {
  # First, add the three regexes to an array. (so it's easier to
  # understand and edit them independently of each other)
  #
  # The regexes are pre-compiled using perl's qr() quoting
  # operator (see `perldoc -f qr`)
  my @fields;

  # Using posix character classes only
  # push @fields, qr(^.*/[[:alpha:]’']+[[:digit:]]{2,4});
  # push @fields, qr=[-[:alnum:],&_#\(\)[:space:]\.’’']{0,99}=;
  # push @fields, qr(\.[[:alnum:]]{3,4});

  # Using perl's \w, \d, and \s
  # note that \w also matches underscores.
  push @fields, qr(^.*/[\w’']+\d{2,4});
  push @fields, qr=[-\w\d,&#\(\)\s\.’’']{0,99}=;
  push @fields, qr(\.[\w\d]{3,4});

  # then merge them into one scalar variable
  $re = join("", @fields);
};

chomp;
print "$_\n" unless /$re/;

Save that somewhere in your $PATH (e.g. ~/bin or /usr/local/bin) as, e.g., match-non-compliant-filenames.pl and make it executable with chmod +x. Then you could run the find command like so:

find . -type f -print0 | match-non-compliant-filenames.pl

  1. If you're going to use a stand-alone perl script to do the filename matching, then it's worth noting that perl has a File::Find library module which performs the same basic function as find. It can be used to create a completely stand-alone script that doesn't need to have input piped in from find.

    File::Find doesn't have all of find's predicates built-in, but you can use any perl code to determine which filenames match and which don't, and what to do with them (e.g. print the filenames or delete or rename them or add them to an array for later use in the script).

    And perl's Getopt::Std can be used to process short, single-letter command-line options (alternatively, use Getopt::Long to process both short and long options).

    All three of these modules are part of perl's core library and are included with perl.

For an example of what's possible:

#!/usr/bin/perl

use v5.14;
use File::Find;
use Getopt::Std;

my @fields;
my $re;

# default to using NUL as the output filename separator
my $sep = "\0";

# -l option changes output separator to newline
# -d enables printing some debugging info
getopts('ld', \my %opts);

$sep = "\n" if $opts{l};

printf "sep = 0x%02x\n", ord($sep) if $opts{d};

# This script will search in all directories specified on the
# command line. If none are specified, default to searching in
# the current directory (same as what find does)
push @ARGV, '.' unless @ARGV;
print "ARGV = ", join(", ", @ARGV), "\n" if $opts{d};

# First, add the three regexes to an array called @fields. (so
# it's easier to understand and edit them independently)
#
# The regexes are pre-compiled using perl's qr() quoting operator
# (see `perldoc -f qr`)

# Note that in this script, there's no need to prefix the
# first field with ^\./ or ^.*/ because the wanted() function called
# by File::Find is written to only match against the file's basename,
# not the full pathname.

# Using posix character classes only
# push @fields, qr([[:alpha:]’']+[[:digit:]]{2,4});
# push @fields, qr=[-[:alnum:],&_#\(\)[:space:]\.’’']{0,99}=;
# push @fields, qr(\.[[:alnum:]]{3,4});

# Using perl's \w, \d, and \s
# note that \w also matches underscores.
push @fields, qr([\w’']+\d{2,4});
push @fields, qr=[-\w\d,&#\(\)\s\.’’']{0,99}=;
push @fields, qr(\.[\w\d]{3,4});

# then merge them into one scalar variable
$re = join("", @fields);
print "re = '$re'\n" if $opts{d};

# find the non-compliant files with a depth-first search
File::Find::finddepth({wanted => \&wanted}, @ARGV);

sub wanted {
  next unless -f ;

  print "$File::Find::name$sep" unless m/$re/;
};

Save in your $PATH as, e.g., find-non-compliant-filenames.pl, make executable with chmod +x find-non-compliant-filenames.pl.

Sample output (using -d which you wouldn't normally use):

$ find-non-compliant-filenames.pl -l -d testfldr/
sep = 0x0a
ARGV = testfldr/
re = '(?^u:[\w’']+\d{2,4})(?^u:[-\w\d,&_#\(\)\s\.’’']{0,99})(?^u:\.[\w\d]{3,4})'

  1. Whatever you decide to use (find alone, find + perl, perl alone, find + grep, find + awk, or something else entirely), if you use the same command or very similar commands frequently, it's worth the effort to write your own custom tools to perform your custom tasks. That's a large part of the point of shell (and perl and awk etc) scripting. Unix becomes far more useful & powerful when you realise that you can and should be a tool maker as well as a tool user.
3
  • 1
    Note that in UTF-8 locales, is character encoded on 3 bytes: 0xe2 0x80 0x99. By default, perl works at byte level, so [’] would be like [\xe2\x80\x99] and match any of those 3 bytes, whether they're found in the encoding of or any other character. Commented 2 days ago
  • 1
    i'll look at this tomorrow sometime, if i get a chance - it's too late (it's 3am here) to do anything about it now. there's a few things i want to fix in the perl REs anyway - ( and ) don't need to be escaped in a perl bracketed expr. and i somehow doubled the . if you have any suggestions in the meantime, feel free to comment. Commented 2 days ago
  • Good answer +1 ... I learned some things. & appreciate yr comments in 6. I actually started this project w/ a simple shell script w/ the idea to develop a tool for managing this "graphics archive". I did not realize how "fragmented" regex usage has become. I used Perl exclusively many years ago, but I'm not likely to "go back" at this stage :) Commented 2 days ago
3

Two things:

  1. \s is not supported by GNU find. There are \w and \W for word and non-word, but no \s. See the documentation. Also: Why does my regular expression work in X but not in Y? on why the online tool and why the documentation for some other tool isn't going to be of help here.
  2. And even though \w is supported, you can't use it inside [...]. The \ is literal inside [...], so [\w] matches either \ or w.
4
  • sorry, but the relevance of yr "thing #2" escapes me... i didn't ask about w. Commented 2 days ago
  • 1
    \s works with GNU find on GNU systems at least, just not inside [...] where POSIX requires it to match on backslash or s. Commented 2 days ago
  • @StéphaneChazelas Interesting. Is it an undocumented extension? Commented 2 days ago
  • 2
    @Seamus no, but the answer explained that \w and \W are supported, unlike \s, so it seems relevant to clarify that they cannot be used inside a character class. Maybe not so much for you, but for the next person with the same question. Commented 2 days ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.