0

The following awk code wraps lines at the 72 column:

awk -v maxLen=72 '
    {
        out = sep = ""
        for ( i=1; i<=NF; i++ ) {
            nextOut = out sep $i
            if ( length(nextOut) > maxLen ) {
                print out
                out = $i
            }
            else {
                out = nextOut
                sep = FS
            }
        }
        print out
    }
' "$1" > "$2"

input.txt:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo tempor incididunt ut labore et dolore magna aliqua.

output.txt:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo
tempor incididunt ut labore et dolore magna aliqua.

The problem is, however, that it works for ASCII texts only. If the text is using e.g. Cyrillic letters, lines become much shorter.

input.txt:

Лорем ипсум долор сит амет, консектетур адиписцинг элит, сед до еусимо темпор инцидидант ют лаборе эт долоре магна аликуа.

output.txt:

Лорем ипсум долор сит амет, консектетур
адиписцинг элит, сед до еусимо темпор
инцидидант ют лаборе эт долоре магна
аликуа.

If I understand correctly, this is because awk counts bytes, not characters. But how this can be fixed?

tech. note: I use awk supplied with macOS.

2
  • Aside: Regarding $1 > $2 - no, use "$1" > "$2" instead as I mentioned. I fixed it in your question. Please see mywiki.wooledge.org/Quotes and run your shell scripts through shellcheck.net until you're more familiar with shell programming as it can catch some issues like that. Commented Sep 2, 2024 at 13:19
  • 1
    FYI POSIX requires awk to work on characters, not bytes, so if the awk you're using doesn't do that then it's not POSIX compliant. The default (BSD) awk on MacOS is notoriously buggy, though, (e.g. see unix.stackexchange.com/q/356234/133219 and unix.stackexchange.com/a/588743/133219) so you should avoid it anyway and install GNU awk. Commented Sep 2, 2024 at 13:33

2 Answers 2

3

As you say, your version of awk seems to count bytes, not characters. To fix this, use a character-aware implementation such as GNU Awk or The One True Awk (as updated for the second edition of The AWK Programming Language).

GNU Awk produces

Лорем ипсум долор сит амет, консектетур адиписцинг элит, сед до еусимо
темпор инцидидант ют лаборе эт долоре магна аликуа.

with your example input in a UTF-8 locale.

On macOS, both of these implementations can be installed using Homebrew, albeit one at a time (they conflict with each other):

brew install gawk

installs GNU Awk, whereas

brew install awk

installs The One True Awk.

0
0

If you need to support non-ASCII characters and non-English text, you also need to consider double width or zero-width (such as combining mark) characters, and the plethora of spacing characters that can be found in Unicode, some of which lines must not be broken upon such as the non-breaking space.

Here, I'd use perl and the Unicode::LineBreak module which implements Unicode's line-breaking algorithm (or Text::LineFold that comes with it geared towards email messages; or Text::Wrap, shipped with perl, supporting TAB / combining marks but not double-width characters; or Text::WrapI18N).

#! /bin/sh -
perl -C -MUnicode::LineBreak -ne '
  BEGIN {$lb = Unicode::LineBreak->new(ColMax => 72, Format => TRIM)}
  print for $lb->break($_)' < "$1" > "$2"

Example:

$ cat file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo tempor incididunt ut labore et dolore magna aliqua.
Лорем ипсум долор сит амет, консектетур адиписцинг элит, сед до еусимо темпор инцидидант ют лаборе эт долоре магна аликуа.
Après mure reflexion et plusieurs heures de tergiversation, elle dit : « Ce n'est pas pour moi »
                                                                     72|
$ perl -C -Mcharnames='()' -pe 's/\P{ascii}/"\\N{".charnames::viacode(ord$&) ."}"/ge' file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo tempor incididunt ut labore et dolore magna aliqua.
\N{CYRILLIC CAPITAL LETTER EL}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER EM} \N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER PE}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER EM} \N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER} \N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER TE} \N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER EM}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER TE}, \N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER TE}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER TE}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER ER} \N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER PE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER TSE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER GHE} \N{CYRILLIC SMALL LETTER E}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER TE}, \N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER DE} \N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER O} \N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER EM}\N{CYRILLIC SMALL LETTER O} \N{CYRILLIC SMALL LETTER TE}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER EM}\N{CYRILLIC SMALL LETTER PE}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER} \N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER TSE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER TE} \N{CYRILLIC SMALL LETTER YU}\N{CYRILLIC SMALL LETTER TE} \N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER IE} \N{CYRILLIC SMALL LETTER E}\N{CYRILLIC SMALL LETTER TE} \N{CYRILLIC SMALL LETTER DE}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER IE} \N{CYRILLIC SMALL LETTER EM}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER GHE}\N{CYRILLIC SMALL LETTER EN}\N{CYRILLIC SMALL LETTER A} \N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER I}\N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER A}.
Apre\N{COMBINING GRAVE ACCENT}s mure reflexion et plusieurs heures de tergiversation, elle dit\N{NO-BREAK SPACE}: \N{LEFT-POINTING DOUBLE ANGLE QUOTATION MARK}\N{NO-BREAK SPACE}Ce n'est pas pour moi\N{NO-BREAK SPACE}\N{RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK}
                                                                     72|
$ perl -C -MUnicode::LineBreak -ne 'BEGIN{$lb = Unicode::LineBreak->new(ColMax => 72, Format => TRIM)} print for $lb->break($_)' file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do euismo
tempor incididunt ut labore et dolore magna aliqua.
Лорем ипсум долор сит амет, консектетур адиписцинг элит, сед до еусимо
темпор инцидидант ют лаборе эт долоре магна аликуа.
Après mure reflexion et plusieurs heures de tergiversation, elle dit :
« Ce n'est pas pour moi »
                                                                     72|

If the input contains TAB characters, you may want to feed the input to expand first (specifying where the tabulation stops are expected to be if not 8 columns apart). Beware not all expand implementations support zero-width or double-width characters though IIRC BSD's implementations generally do. See also col -b if the input contains backspace characters (as sometimes used for bold or underline).

Also note the use of redirection in < "$1" > "$2". The advantages (over using perl -ne '...' -- "$1" > "$2") are:

  • if $1 can't be opened for reading, then the command is aborted even before $2 is opened for writing (and possibly truncated).
  • As noted at Security implications of running perl -ne '...' *, perl -n can't be used for arbitrary file paths as it handles some names specially.

Both also apply to awk, though the effect of the second is less dramatic there (where a file called - or OFS=\n would be a problem but can probably not lead up to arbitrary command execution).

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.