3

I'm standardizing the name of several files at once, so I wrote a regex for perl-rename:

perl-rename 'y/A-Z/a-z/; s/ã|á|â/a/g; s/é|ê/e/g; s/í/i/g; s/õ|ó/o/g; s/ú/u/g; s/ç/c/g; s/(?<=\d-)*\s/_/g; s/_+/_/g; s/(?<=\d)_/-/' *

It's working exactly as expected:

2024-12-01 certidão de matrícula -> 2024-12-01-certidao_de_matricula

However, I thought it could be simplified a bit, so I came up with this:

perl-rename 'y/A-Z/a-z/; y/ãáâéêíõóúç/aaaeeioouc/; s/(?<=\d-)*\s/_/g; s/_+/_/g; s/(?<=\d)_/-/' *

The results, however, did not meet my expectations. For instance:

2024-12-01 certidão de matrícula -> 2024-12-01-certidaao_de_matraccula

Why is the second command not working, since it should be doing a direct transliteration of each accented character? I can't even make sense of the outcome. Thanks in advance.

2
  • 2
    Oh well, Perl and unicode is a sad story. I didn't try, but in your case export PERL_UNICODE=AS and prepending use utf8; (use utf8; y/A-Z/a-z/; y...) to the rename arg should work. Whenever you see duplicated symbols or unexpected unrelated replacements with UTF8 chars, check why unicode is mishandled. Look at their binary representations to see yourself why exactly those undesired matches happen - you'll find equal single-byte parts. Commented Dec 24, 2024 at 20:11
  • 1
    Here's an excellent answer on SO explaining some Unicode perks and perl-UTF8 interaction specifics: stackoverflow.com/a/6163129/14401160. Commented Dec 24, 2024 at 20:17

1 Answer 1

6

That happens because Perl doesn't know it's supposed to treat the filenames as UTF-8. Instead, it looks at the individual bytes, so you get partial replaces and duplicates. E.g. consider this:

% echo 'ä:ö:ä' | perl -pe 'y/ä/x/'
xx:x�:xx

Here, the input is \xc3\xa4:\xc3\xb6:\xc3\xa4 and the y/ä/x/ is taken as y/\xc3\xa4/x/ (with the right-hand side x implicitly duplicated). The result is that the UTF-8 ä turns into xx, when both bytes get individually replaced, and the UTF-8 ö breaks down when only the first byte is replaced.

The s/// command works, because it looks for the whole string to be replaced, so it doesn't matter if it's interpreted as one character or two bytes.

You'd fix that by adding use utf8 or -Mutf8 to tell Perl the source is in UTF-8, and the -C option to tell it stdin/stdout are in UTF-8.

% echo 'ä:ö:ä' | perl -Mutf8 -C -pe 'y/ä/x/'
x:ö:x

With the rename script, you probably don't get to use -C, so use the PERL_UNICODE environment variable, e.g. with export PERL_UNICODE=AS (for Arguments and Stdin/out), and include use utf8 in your rename command and hope it works.

1
  • 1
    Strangely enough, simply exporting the environment variable was sufficient to make it work. Anyway, I'll run some more tests just to make sure. Thanks for the explanation! Commented Dec 24, 2024 at 23:06

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.