If you need to support non-ASCII characters and non-English text, you also need to consider double width or zero-width (such as combining mark) characters, and the plethora of spacing characters that can be found in Unicode, some of which lines must not be broken upon such as the non-breaking space.
Here, I'd use perl and the Unicode::LineBreak module which implements Unicode's line-breaking algorithm (or Text::LineFold that comes with it geared towards email messages; or Text::Wrap, shipped with perl, supporting TAB / combining marks but not double-width characters; or Text::WrapI18N).
#! /bin/sh -
perl -C -MUnicode::LineBreak -ne '
BEGIN {$lb = Unicode::LineBreak->new(ColMax => 72, Format => TRIM)}
print for $lb->break($_)' < "$1" > "$2"
Example:
$ cat file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo tempor incididunt ut labore et dolore magna aliqua.
Лорем ипсум долор сит амет, консектетур адиписцинг элит, сед до еусимо темпор инцидидант ют лаборе эт долоре магна аликуа.
Après mure reflexion et plusieurs heures de tergiversation, elle dit : « Ce n'est pas pour moi »
72|
$ perl -C -Mcharnames='()' -pe 's/\P{ascii}/"\\N{".charnames::viacode(ord$&) ."}"/ge' file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo tempor incididunt ut labore et dolore magna aliqua.
\N{CYRILLIC CAPITAL LETTER EL}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER EM} \N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER PE}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER EM} \N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER} \N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER TE} \N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER EM}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER TE}, \N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER TE}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER TE}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER ER} \N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER PE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER TSE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER GHE} \N{CYRILLIC SMALL LETTER E}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER TE}, \N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER DE} \N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER O} \N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER EM}\N{CYRILLIC SMALL LETTER O} \N{CYRILLIC SMALL LETTER TE}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER EM}\N{CYRILLIC SMALL LETTER PE}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER} \N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER TSE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER TE} \N{CYRILLIC SMALL LETTER YU}\N{CYRILLIC SMALL LETTER TE} \N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER IE} \N{CYRILLIC SMALL LETTER E}\N{CYRILLIC SMALL LETTER TE} \N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER IE} \N{CYRILLIC SMALL LETTER EM}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER GHE}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER A} \N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER A}.
Apre\N{COMBINING GRAVE ACCENT}s mure reflexion et plusieurs heures de tergiversation, elle dit\N{NO-BREAK SPACE}: \N{LEFT-POINTING DOUBLE ANGLE QUOTATION MARK}\N{NO-BREAK SPACE}Ce n'est pas pour moi\N{NO-BREAK SPACE}\N{RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK}
72|
$ perl -C -MUnicode::LineBreak -ne 'BEGIN{$lb = Unicode::LineBreak->new(ColMax => 72, Format => TRIM)} print for $lb->break($_)' file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo
tempor incididunt ut labore et dolore magna aliqua.
Лорем ипсум долор сит амет, консектетур адиписцинг элит, сед до еусимо
темпор инцидидант ют лаборе эт долоре магна аликуа.
Après mure reflexion et plusieurs heures de tergiversation, elle dit :
« Ce n'est pas pour moi »
72|
If the input contains TAB characters, you may want to feed the input to expand first (specifying where the tabulation stops are expected to be if not 8 columns apart). Beware not all expand implementations support zero-width or double-width characters though IIRC BSD's implementations generally do. See also col -b if the input contains backspace characters (as sometimes used for bold or underline).
Also note the use of redirection in < "$1" > "$2". The advantages (over using perl -ne '...' -- "$1" > "$2") are:
- if
$1 can't be opened for reading, then the command is aborted even before $2 is opened for writing (and possibly truncated).
- As noted at Security implications of running perl -ne '...' *,
perl -n can't be used for arbitrary file paths as it handles some names specially.
Both also apply to awk, though the effect of the second is less dramatic there (where a file called - or OFS=\n would be a problem but can probably not lead up to arbitrary command execution).
$1 > $2- no, use"$1" > "$2"instead as I mentioned. I fixed it in your question. Please see mywiki.wooledge.org/Quotes and run your shell scripts through shellcheck.net until you're more familiar with shell programming as it can catch some issues like that.