regex to find text plus trailing 10+ spaces; sum of two matches' lengths is part of matching condition

Question

I'm trying to clean files that are copy/pasted versions of my Cygwin (mintty) terminal running bash¹. Usually, the input and output are separated by a linefeed ('\n'), as expected. However, when I look at the output in my text editor/IDE (Notepad++, in my case), the command is sometimes followed by enough spaces to get to the terminal width and then the output. (The terminal width can be found by running tput cols in the original terminal.) I can't tell the difference when looking at my terminal, but it's evident when looking at my programmer's notebook. I do know about other terminal logging solutions², but my question is about the copy/paste situation.

TL;DR

If I have the following file (a log of the terminal input/output where the terminal width is 80 characters). The middle two commands (a command being a line starting with $) don't show the problem and have a comment with OK after them.

$ tput cols                                                                     80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ echo "Cédric,Žemaičių Naumiesčio"                                             Cédric, Žemaičių Naumiesčio

$

I would like to change it (either by checking each match of a bad line one-by-one and manually making the necessary edits, by editing the file in-place, or by redirecting the edited version to a separate file), so that I get

$ tput cols
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ echo "Cédric,Žemaičių Naumiesčio"
Cédric, Žemaičių Naumiesčio

$

I imagine that raku or perl might have some solution to count the initial characters plus spaces to see if they match the column width—here it's 80. Could anyone show me such raku/perl solution? Any sed or awk solutions would also be nice, as they would allow me to use the search-and-replace inside vim, with the similar regex style. Does anyone know how to do this? I have what I feel to be a clunky solution using bash with the constructs, ~=, ${BASH_REMATCH[n]}, ${#some_string} and $(( ... )) arithmetic. This solution is further down the question. I hope that there's something more elegant than my attempt, specifically a one-liner or small function/script.

By the way, this isn't so important that it needs to be fail-proof; I'm trying to a check for initial-characters + 10-or-more-spaces having a length of 80 (or whatever the terminal width happens to be) and a non-space character after the last of the spaces. Edit: I wasn't clear on this before, but note that lengths that are multiples of 80 including the 10-or-more last spaces also match for being split. However, the character count does need to be of UTF-8 encoded characters.

A More Thorough Example of the Problem

Though I apologize for the overflow requiring horizontal scrolling, I need it to show an example of the problem. One such example is in the contents of the file, terminal_logfile_woes.log.

$ tput cols                                                                     80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list                   # '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'\

$ find ./the_dir_with_thirteen_files/ -type f | wc -l                           13

$ find . -type f -iname "*no_file_with_this*" | grep -oP "[\x00-\x20\x7F-\xFF]" | sed 's#\(.\)#'"'"'\1'"'"', #g; s#, $##g;' | wc -l                             0

$ a_short_string="abc"  # OK

$ a_quite_long_string="abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWon'tYouSingWithMe; This song ruined 'zed' for us Americans : ("  # OK

$ echo "${a_quite_long_string}"                                                 abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (

$ whoami  # OK
bballdave025

$ echo "你好。我不知道。"                                                       你好。我不知道。

$ echo I want to put in               more=15 spaces.  # OK
I want to put in more=15 spaces.

$ echo "Cédric,Žemaičių Naumiesčio"                                            Cédric,Žemaičių Naumiesčio

$

_{A quick note about the find . -type f -iname "*no_file_with_this*" ... 0 line. I've entered it here in the editor, checking that there are 161 character glyphs including the 0. (I press right-arrow 161 times from the very beginning and end up at the right side of the zero.) However, when I copy it back and forth from here, from my Cygwin terminal, from Notepad++, from vim, from my word processor, etc., I sometimes end up with 167 total glyphs, sometimes with 153 total glyphs, sometimes with 162 total glyphs, and a few times with other numbers. If you can't get that line right, you'll either need to check that your version of the logfile has 161 glyphs or don't worry about that line. As far as I can tell, when I count it it from the "posted" question window (not the preview), it always has 161 glyphs. There should be 29 spaces between the l of wc -l and the 0 at the end; 131 glyphs through wc -l, 29 spaces, and the last 0 makes 161.}

_{Maybe I'm doing my counting wrong. Curiously enough, my function worked with an older version that, when pasted in the editor, had only 153 glyphs. It doesn't work with the current version. Perhaps it's best to leave the condition as some of you have suggested: "change sequences of 10 or more spaces to a new line if the line length is > 80 characters" (as stated by @cas).}

The count needs to be of UTF-8 encoded characters, since that's what my terminal is set up to use. I can use a solution that will count correctly for everything but not-necessarily-monospaced-in-monospace-font characters, such as CJK characters (archived Wikipedia page as I see it) (though I prefer calling them CJKV characters {archived}). My clunky solution, below, does not handle these correctly. A good test is that last string, Cédric,Žemaičių Naumiesčio, which should have 26 characters. The following two commands both give the wrong count, because they are counting bytes, not UTF-8 characters encoded as bytes.

$ printf %s "Cédric,Žemaičių Naumiesčio" | wc -c
31

$ printf %s "Cédric,Žemaičių Naumiesčio" | LC_ALL=C grep -o . | wc -l
31

Trying the last command without the line count, i.e. printf %s "Cédric,Žemaičių Naumiesčio" | LC_ALL=C grep -o ., might help to illustrate the issue.

The next command does give the correct count.

$ printf %s "Cédric,Žemaičių Naumiesčio." | LC_ALL=C.UTF-8 grep -o . | wc -l
27

Again, it might be illustrative to see the output without the | wc -l, i.e. printf %s "Cédric,Žemaičių Naumiesčio." | LC_ALL=C.UTF-8 grep -o . and to compare it with the LC_ALL=C version.

(You can also try the string, 你好。我不知道。—the count should be 8. However, the different width of CJKV characters in many fonts means that the general principle will not work, as the number of UTF-8 encoded characters isn't 80 when the characters go to the next line. If I really needed to keep everything the same width—a truly monospaced Unicode Plane 0 font—I could use Unifont {archived} for my terminal font.)

My Clunky Solution

Because it was easier for providing an example for this question, I created the following function from the terminal prompt, though I could just as easily have created a script for it. It's based on this answer (archived) from @glenn-jackman here on U&L SE.

$ find_spaces_not_linefeed() {
  use="Usage:\n% find_spaces_not_linefeed LOG [WIDTH] [DO_CHECK_LENGTH]"
  if [ $# -eq 0 ]; then echo -e "Path to logfile required\n${use}"; fi
  if [ $# -ge 1 ]; then input_logfile_name=$1; fi
  terminal_width=80; if [ $# -ge 2 ]; then terminal_width=$2; fi
  do_check=0; if [ $# -ge 3 ]; then do_check=$3; fi
  change_count=0 
  while IFS= read line; do
    this_str="${line}"
    if [[ $this_str =~ (^.+)([ ]{10,})([^ ].*$) ]]; then
      beg=${BASH_REMATCH[1]} spaces=${BASH_REMATCH[2]} end=${BASH_REMATCH[3]}
      len_beg="${#beg}" len_spaces="${#spaces}" len_end="${#end}"
      if [ $do_check -ne 0 ]; then
        echo; echo "# FOR CHECKING #"; echo -n "len_beg: ${len_beg} "
        echo -n "len_spaces: ${len_spaces} "; echo "len_end: ${len_end}"
      fi ##endof: if [ $do_check -ne 0 ]
      is_a_match=0  # guilty until proven innocent
      test_value=$(( ($len_beg + $len_spaces) % 80 ))
      test ${test_value} -eq 0 && is_a_match=1
      if [ $is_a_match -eq 1 ]; then
        change_count=$(echo "${change_count}+1" | bc)
        if [ $do_check -eq 0 ]; then echo; fi
        echo "  CHANGE"; echo "${this_str}"; echo "  TO"
        echo "${beg}"  # Will put in a linefeed between the two
        echo "${end}"
        echo "##### WITH a linefeed and not spaces."
      fi ##endof:  if [ $is_a_match -eq 1 ]
    fi ##endof:  if <bash_rematch regex matches>
  done < "${input_logfile_name}"
  test $change_count -eq 0 && echo "No matches found. No changes needed."
} ##endof:  find_spaces_not_linefeed

_{I have a different use variable, with a more-detailed string, defined as a heredoc³.}

When I run it on terminal_logfile_woes.log, I get the following, where I've included my $PS1 (prompt info) to differentiate my terminal and the logged terminal's I/O.

bballdave025@MY_MACHINE ~/logfile_problems
$ find_spaces_not_linefeed terminal_logfile_woes.log 80

  CHANGE
$ tput cols                                                                     80
  TO
$ tput cols
80
##### WITH a linefeed and not spaces.

  CHANGE
$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list                   # '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'
  TO
$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list
# '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'
##### WITH a linefeed and not spaces.

  CHANGE
$ find ./the_dir_with_thirteen_files/ -type f | wc -l                           13
  TO
$ find ./the_dir_with_thirteen_files/ -type f | wc -l
13
##### WITH a linefeed and not spaces.

  CHANGE
$ find . -type f -iname "*no_file_with_this*" | grep -oP "[x00-x20x7F-xFF]" | sed 's#(.)#'"'"'1'"'"', #g; s#, $##g;' | wc -l                                    0
  TO
$ find . -type f -iname "*no_file_with_this*" | grep -oP "[x00-x20x7F-xFF]" | sed 's#(.)#'"'"'1'"'"', #g; s#, $##g;' | wc -l
0
##### WITH a linefeed and not spaces.

  CHANGE
$ echo "${a_quite_long_string}"                                                 abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (
  TO
$ echo "${a_quite_long_string}"
abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (
##### WITH a linefeed and not spaces.

  CHANGE
$ echo "Cédric,Žemaičių Naumiesčio"                                             Cédric, Žemaičių Naumiesčio
  TO
$ echo "Cédric,Žemaičių Naumiesčio"
Cédric, Žemaičių Naumiesčio
##### WITH a linefeed and not spaces.

bballdave025@MY_MACHINE ~/logfile_problems
$

I haven't attempted a search and replace, because I'm very unsure how everything in terminal input and output (e.g. $, #, many single and double quotes, all the types of brackets) could be consistently escaped in a search and replace.

Edit: I apologize for not being clearer with my expected output.

The output above is sort of a pre-version, basically stopping just short of doing the search and replace. A lot of the answers now—1755528589 a.k.a. 2025-08-18T164949+0000— have already done the search and replace. The even-better expected output, with the search and replace performed, is

$ tput cols
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list
# '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'\

$ find ./the_dir_with_thirteen_files/ -type f | wc -l
13

$ find . -type f -iname "*no_file_with_this*" | grep -oP "[\x00-\x20\x7F-\xFF]" | sed 's#\(.\)#'"'"'\1'"'"', #g; s#, $##g;' | wc -l
0

$ a_short_string="abc"  # OK

$ a_quite_long_string="abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWon'tYouSingWithMe; This song ruined 'zed' for us Americans : ("  # OK

$ echo "${a_quite_long_string}"
abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (

$ whoami  # OK
bballdave025

$ echo "你好。我不知道。"                                                       你好。我不知道。

$ echo I want to put in               more=15 spaces.  # OK
I want to put in more=15 spaces.

$ echo "Cédric,Žemaičių Naumiesčio"
Cédric,Žemaičių Naumiesčio

$

Note that I included the CJK(V) characters, I included them as an example of something that I don't need fixed, but I appreciate those of you who have looked into the solution to that issue as well.

A lot of you have gotten to my expected output, even without my being overly clear. It will be quite a task to choose an accepted answer. I'll give it about a day, since some of you might change your answers given my clarifications.

Notes

[1]

_{My System}

_{$ uname -a
CYGWIN_NT-10.0-19045 MY_MACHINE 3.6.3-1.x86_64 2025-06-05 11:45 UTC x86_64 Cygwin

$ bash --version | head -n 1
GNU bash, version 5.2.21(1)-release (x86_64-pc-cygwin)}

[2]

_{Instead of the line with the use="Usage... code in the version of the function posted above, my function actually has its use variable defined in a heredoc.
IFS='\n' read -r -d '' use <<'EndOfUsage'
Usage:
% find_spaces_not_linefeed LOGFILENAME [TERMINAL_WIDTH] [DO_CHECK_FOR_LENGTH]

LOGFILENAME The name of the file to which you pasted the terminal
I/O that was copied from a Cygwin (mintty) terminal. If
there is no path included, it is assumed the file is in the
working directory. This is the only required argument.
TERMINAL_WIDTH Defaults to 80. If you still have the Cygwin (mintty)
terminal (from which you copy/pasted) up, you can find this
with `tput cols'. If you're using a new terminal to run
this `find_spaces_not_linefeed', there is no guarantee.
DO_CHECK_FOR_LENGTH A boolean that defaults to 0. Any other (`!= "0"') val
will give lengths of anything in a command before 10 or more
spaces, the number of spaces (if 10 or more) and the length
of anything (presumably the output from the command) after
the 10-plus spaces.

`find_spaces_not_linefeed' is meant to correct instances where copying from
a Cygwin (mintty) terminal and pasting into a text editor/IDE sometimes
doesn't give a linefeed between a command and its output, but instead adds
spaces until a multiple of the terminal width is reached
( `[ $(( length % $(tput cols) -eq 0 ]' is true ),
then prints the output of the command. It doesn't deal with edge cases,
including the fact that there could be instances where the length of the
command would make the number of spaces less than 10 or even 0.
EndOfUsage

This could probably also serve as a good recap of the reasons for my question.}

[3]

_{While I do use what I call "script scrubbers", most based on a 2008 Perl function (archived) by @repellent in response to @mrslopenk's question on PerlMonks—I can share a more rubust extension of the code here on Unix.SE (archived) or more (archived), similar ways and/or more unique ways (archived) ways to automatically log and read terminal I/O (archived), and then some (archived) ... here's a new good source (archived) I want to try—sometimes a good ol' Select All and Copy/Paste is a good way to grab the contents of the Cygwin terminal window. As you see here, Copy/Paste has a few problems, the same as all the others solutions have their own problems.}

_{Edit: Something I hadn't clarified (sorry @cas), the solutions are used with the output of the script command.}

how are these files created? using the cursor to cut-n-paste? some sort of terminal logging tool? — markp-fuso
– markp-fuso, Commented Aug 12 at 15:14
will each file's first line consist of a tput cols call? are we supposed to use the result of the tput cols call to determine where to 'split' lines? — markp-fuso
– markp-fuso, Commented Aug 12 at 15:15
can you include an example where the 'first' line is 78 characters wide (terminal is 80 wide); would need to see if there's just a single space between the 'first' and 'second' lines, or some other number of spaces — markp-fuso
– markp-fuso, Commented Aug 12 at 15:29
for the appended lines, have you tried looking at the raw bytes (eg, od -c) to see if there's a non-printing character that delimits the 'first' and 'second' lines? — markp-fuso
– markp-fuso, Commented Aug 12 at 15:36
Please edit your question to provide the expected output for your more thorough sample input so we can test a potential solution using it and get a simple pass/fail result. — Ed Morton
– Ed Morton, Commented Aug 13 at 13:20

markp-fuso · Accepted Answer · 2025-08-14 16:46:34Z

Assumptions:

lines that have been merged are separated by at least 10 spaces
the 2nd half of a merged line (ie, the second line) does not contain 10+ spaces
the total length of a merged line is at least 80 characters (objective is to keep from splitting a single (non-merged) line that contains 10+ spaces; could be problematic for a) a non-merged line with 10+ spaces and longer than 80 characters or b) a merged line with a lot of Chinese (?) characters such that total merged line length < 80 characters - see the end of this answer for info on a custom awk library of functions for dealing with variable display width characters)

Approach:

use the rev binary to reverse all lines (characterwise)
search for first occurrence of 10+ spaces
if total length of the input line >= 80 then we have a merged line and we'll split it based on the location of the (first match of) 10+ spaces
use the rev binary to reverse all lines back to their original characterwise ordering

One GNU awk solution:

rev filename | 
awk -v width=80 '
match($0,/[ ]{10,}/) { if (length() >= width) {               # if 10+ spaces and length >= 80
                                                              # then this is a merged line so
                          print substr($0,RSTART+RLENGTH)     # print 1st line
                          print substr($0,1,RSTART-1)         # print 2nd line
                          next                                # go to next line of input
                       }
                     }
1                                                             # else print nonmerged lines
' |
rev

Taking for a test drive ...

filename == OP's first data set:

$ tput cols
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ echo "Cédric,Žemaičių Naumiesčio"
Cédric, Žemaičių Naumiesčio

$

filename == OP's second data set:

$ tput cols
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list
# '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'\

$ find ./the_dir_with_thirteen_files/ -type f | wc -l
13

$ find . -type f -iname "*no_file_with_this*" | grep -oP "[\x00-\x20\x7F-\xFF]" | sed 's#\(.\)#'"'"'\1'"'"', #g; s#, $##g;' | wc -l
0

$ a_short_string="abc"  # OK

$ a_quite_long_string="abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWon'tYouSingWithMe; This song ruined 'zed' for us Americans : ("  # OK

$ echo "${a_quite_long_string}"
abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (

$ whoami  # OK
bballdave025

$ echo "你好。我不知道。"
你好。我不知道。

$ echo I want to put in               more=15 spaces.  # OK
I want to put in more=15 spaces.

$ echo "Cédric,Žemaičių Naumiesčio"
Cédric,Žemaičių Naumiesčio

$

Addendum

A web search on determining display/column widths of characters brought up a unix.stackexchange.com Q&A with this answer being of particular interest.

That answer references a custom awk library that provides awk functions for processing variable (display) width characters.

Replacing the length() call (above) with wcscolumns($0) should provide a bit more stability to the above awk script as far as dealing with variable (display) width characters.

Thanks for the comment; now I understand why your awk was checking the length! I deleted my answer since, as you correctly pointed out, I had missed one of the requirement. — terdon
– terdon ♦, Commented Aug 13 at 15:14

cas · Accepted Answer · 2025-08-18 17:57:34Z

I'd say you're on the right track using perl (as awk and sed don't really handle unicode characters properly), but it seems the core of your problem is trailing spaces on each line. You can fix that with sed. or perl. e.g.

sed -e 's/ *$//'

perl -lpe 's/\s+$//'

You can use the -i option for in-place edit with either perl or sed.

If you're already using perl to process the text, just incorporate the perl s/// operation near the start of your main text processing loop. My while(<<>>) loops tend to start with chomp and a lot of miscellaneous cleanup regexps like that. e.g.

while(<<>>) {
  chomp;
  s/^\s*|\s*$//g;  # strip leading and trailing spaces
  s/\s*#.*//;      # delete # comments
  next if (m/^$/); # skip empty lines

  # real processing starts here
...
}

It should be obvious, but don't use the s/// line for deleting # comments if the input text contains #s that don't mark the beginning of a comment. I mostly use it for config files or lists of filenames or words or whatever that I might want to comment out. And don't use it for markdown or program code or anything else that often has non-comment # characters. There are perl modules for safely stripping comments if I need that.

Example perl one-liner to change sequences of 10 or more spaces to a new line if the line length is > 80 characters:

$ perl -CASD -lpE '
    # If the first arg is entirely numeric, grab it and treat
    # it as the text width.  Default to 80, otherwise:
    BEGIN {$tw = ($ARGV[0] =~ /^\d+$/) ? shift : 80};

    next unless (length > $tw);
    $count += s/\h{10,}/\n/;

    END {
      # print message to stderr so it doesn't interfere with
      # redirection. or with -i (in-place-edit) if that is used.
      print STDERR "No matches found. No changes needed." unless $count;
    }' example2.log 
$ tput cols
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ echo "Cédric,Žemaičių Naumiesčio"
Cédric, Žemaičių Naumiesčio

$

This is far from perfect. For one thing, it makes no attempt to distinguish between 10+ intentional spaces in a line and 10+ unwanted added spaces - they're treated equally. This is partly because there's no easy way to know which are intentional and which are not.

Also, correctly handling unicode is hard - much harder than most people think. I've done the bare minimum with -CASD to make sure that all I/O is treated as UTF-8, which at least makes it count line lengths correctly, and -E rather than -e to enable all the optional modern perl features (which aren't enabled by default so that decades worth of old perl code doesn't break), including unicode_strings. I'm sure Stéphane, one of the all-things-unicode experts here, will come along and point out how and why what I've done is wrong/not good enough. I am certainly not even remotely close to being an expert on unicode.

But apart from line lengths, I don't think unicode characters are a big part of the problem you're trying to solve.

FYI: the difference between using -CASD and not using it (although, for this example, only -CS (stdin, stdout, stderr are assumed to be utf-8) or even -CI (only stdin) would be needed):

$ printf %s "Cédric,Žemaičių Naumiesčio" | perl -CASD -lne 'print length'
26

$ printf %s "Cédric,Žemaičių Naumiesčio" | perl -lne 'print length'
31

BTW, the entire END {...} block is optional. I only added it because it's roughly what your function does. If you delete it, you won't need $count += before the s/\h{10,}/\n/ operation.

One more thing: have you considered using script to generate the transcript logs rather than manually copy-pasting them? It would trade one set of problems for another (but as you noted with your various links), there are viable solutions for dealing with script output...and it's got to be easier and less hassle than manual copy-pasting.

Regarding "awk and sed don't really handle unicode characters properly" - could you provide an example where GNU awk doesn't handle Unicode properly? It may be worth raising a bug report if that is the case, idk. — Ed Morton
– Ed Morton, Commented Aug 13 at 10:24
This has some great thoughts, but the extra spaces in the middle are important. I don't know if that's what you mean by leading and trailing spaces. As I have the files on my machine, I don't have leading and trailing spaces. I'm going to try the solution on my files, though. — bballdave025
– bballdave025, Commented Aug 18 at 14:46
@bballdave025 Leading spaces are any spaces before the first non-space character on a line. Trailing spaces are any spaces after the last non-space character. — cas
– cas, Commented Aug 18 at 14:57
your comment about "the extra spaces in the middle" made me look at your examples again. it seems that you could probably get away with a regex to change sequences of 10+ spaces to a newline IF the line length is > 80. I'll edit my answer and add an example of how to do that. — cas
– cas, Commented Aug 18 at 15:42
Okay, that will be great. I was coming here to look at your answer more closely to see what I might be missing and to ask where the leading and trailing spaces might be, but I think we've got it. — bballdave025
– bballdave025, Commented Aug 18 at 16:24

Hauke Laging · Accepted Answer · 2025-08-12 01:04:19Z

1

This does not handle the case that there are two or more slices of 10+ spaces:

awk -F ' {10,}' '
    NF>1 && length($0)>80 { print "# CHANGE"; print $1; print $2; next; };
    { print "# NO CHANGE"; print; }
    ' inputfile

answered Aug 12 at 1:04

Hauke Laging

94.5k21 gold badges132 silver badges185 bronze badges

This is the first one I saw and does a great, simple job of performing what I was looking for. I like the addition of "# NO CHANGE". I'm going to have a hard time figuring out an accepted answer; this one won't be it, but it's near the top of the list.

bballdave025
– bballdave025

2025-08-18 14:44:47 +00:00
Commented Aug 18 at 14:44
And I think that, no matter which one I pick, this is the one that would come in second : )

bballdave025
– bballdave025

2025-08-18 18:00:02 +00:00
Commented Aug 18 at 18:00

Add a comment |

Ed Morton · Accepted Answer · 2025-08-13 13:15:32Z

This might be what you want, using GNU awk for gensub(), \s/\S shorthand and unicode support:

$ awk '{print gensub(/^(.{70})\s{10}(\S.*)/, "\\1\n\\2", 1)}' file
$ tput cols
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ echo "Cédric,Žemaičių Naumiesčio"
Cédric, Žemaičių Naumiesčio

$

You didn't provide the expected output for the more thorough sample input you provided so idk if this is what you expect or not given that input but here is that awk script running against that input:

$ awk '{print gensub(/^(.{70})\s{10}(\S.*)/, "\\1\n\\2", 1)}' file2
$ tput cols
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list
# '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'\

$ find ./the_dir_with_thirteen_files/ -type f | wc -l
13

$ find . -type f -iname "*no_file_with_this*" | grep -oP "[\x00-\x20\x7F-\xFF]" | sed 's#\(.\)#'"'"'\1'"'"', #g; s#, $##g;' | wc -l                             0

$ a_short_string="abc"  # OK

$ a_quite_long_string="abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWon'tYouSingWithMe; This song ruined 'zed' for us Americans : ("  # OK

$ echo "${a_quite_long_string}"
abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (

$ whoami  # OK
bballdave025

$ echo "你好。我不知道。"                                                       你好。我不知道。

$ echo I want to put in               more=15 spaces.  # OK
I want to put in more=15 spaces.

$ echo "Cédric,Žemaičių Naumiesčio"                                            Cédric,Žemaičių Naumiesčio

$

I don't think I've ever known about about awk with gensub. That's a useful construct. — bballdave025
– bballdave025, Commented Aug 18 at 17:59

wobtax · Accepted Answer · 2025-08-13 15:08:43Z

1

Matching the lines

First, sed is able to match

initial-characters + 10-or-more-spaces having a length of 80 (or whatever the terminal width happens to be) and a non-space character after the last of the spaces

as follows. Suppose the column width is 80.

sed -nE '/^[^[:space:]].{69}[[:space:]]{10}[^[:space:]].*$/p' input.txt

Then this prints out just those lines:

$ tput cols                                                                     80
$ echo "Cédric,Žemaičių Naumiesčio"                                             Cédric, Žemaičių Naumiesčio

Explanation: -n suppresses printing, -E enables extended regex matching, and p prints manually. If your terminal supports UTF-8 encoding, then so will sed.

Editing them

We can take this in sections like (initial line + space)(next line), add a newline between section 1 and 2, and delete any trailing space:

sed -E 's/^([^[:space:]].{69}[[:space:]]{10})([^[:space:]].*)$/\1\n\2/g
                      s/[[:space:]]*$//g' input.txt

Result:

$ tput cols                                                                     
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ echo "Cédric,Žemaičių Naumiesčio"                                             
Cédric, Žemaičių Naumiesčio

$

Finally, if you need to use a variable, like $column_number, just insert it like:

length="$(( column_number - 11 ))"
sed -E 's/^([^[:space:]].{'"$length"'}[[:space:]]{10})([^[:space:]].*)$/\1\n\2/g
                      s/[[:space:]]*$//g' input.txt

answered Aug 13 at 15:08

wobtax

1,1753 silver badges17 bronze badges

Rats, this fails for two lines of the longer example. I'm not sure why, but it seems sed doesn't think they have enough space characters.

wobtax
– wobtax

2025-08-13 15:23:14 +00:00
Commented Aug 13 at 15:23
2

I noticed the same, and there's the same issue in Ed Morton's answer, so the printout in the Q is likely broken wrt. that. Since the "Cédric" line is ok in the first printout, I'm guessing the other one broke when transferring it here, but it's hard to say.

ilkkachu
– ilkkachu

2025-08-13 15:36:14 +00:00
Commented Aug 13 at 15:36
If it failed for you, wobtax, ikkachu, and @Ed-Morton, my guess is that there's a problem with my text. A few I copied from real logs (including logs where I tried to reproduce the problem, or at least create similar situations), and some others I made myself. Let me double check.

bballdave025
– bballdave025

2025-08-18 15:42:07 +00:00
Commented Aug 18 at 15:42
It's my mistake. The line, as copied, has only 153 characters, while it should have 161. Let me change that and double-check it's right.

bballdave025
– bballdave025

2025-08-18 16:21:47 +00:00
Commented Aug 18 at 16:21
I like the use of variable column_number.

bballdave025
– bballdave025

2025-08-18 17:57:14 +00:00
Commented Aug 18 at 17:57

| Show 1 more comment

ilkkachu · Accepted Answer · 2025-08-13 15:33:35Z

The are already a few answers, but another possibility would be to use e.g. Perl's substr() to check just the part of the string around the folding position. It looks like in one case (the find . -type f ...) the command output is at position 160 instead of 80, so let's check all multiples of 80 up to the line length.

newlinefix.pl:

#!/usr/bin/perl -p

$width = 80;
for ($col = $width; $col < length $_; $col += $width) {
    # look for ten spaces before the fold and a non-space after
    if (substr($_, $col - 10, 11) =~ / {10}\S/) {
        substr($_, $col, 0) = "\n";  # insert newline
        s/ +\n/\n/;  # remove any now-trailing spaces
        last;
    }
}

and using it on your two printouts:

% perl -C newlinefix.pl < terminal-dump.txt
$ tput cols
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ echo "Cédric,Žemaičių Naumiesčio"
Cédric, Žemaičių Naumiesčio

$

and

% perl -C newlinefix.pl < terminal-dump2.txt
$ tput cols
80

$ type wc  # OK
wc is hashed (/usr/bin/wc)

$ #  This one is fine. OK

$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list
# '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'\

$ find ./the_dir_with_thirteen_files/ -type f | wc -l
13

$ find . -type f -iname "*no_file_with_this*" | grep -oP "[\x00-\x20\x7F-\xFF]" | sed 's#\(.\)#'"'"'\1'"'"', #g; s#, $##g;' | wc -l
0

$ a_short_string="abc"  # OK

$ a_quite_long_string="abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWon'tYouSingWithMe; This song ruined 'zed' for us Americans : ("  # OK

$ echo "${a_quite_long_string}"
abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (

$ whoami  # OK
bballdave025

$ echo "你好。我不知道。"                                                       你好。我不知道。

$ echo I want to put in               more=15 spaces.  # OK
I want to put in more=15 spaces.

$ echo "Cédric,Žemaičių Naumiesčio"                                            Cédric,Žemaičių Naumiesčio

$

With -C it should work on non-ASCII characters too, but it looks like the columns don't match in the second printout after I copypasted it from the post above. (They don't line up in my browser either.) YMMV.

You correctly surmised that the one line was supposed to be a multiple of 80, which is something I wanted without explicitly stating. (The question has been edited.) — bballdave025
– bballdave025, Commented Aug 18 at 17:56

Stack Exchange Network

regex to find text plus trailing 10+ spaces; sum of two matches' lengths is part of matching condition

TL;DR

A More Thorough Example of the Problem

My Clunky Solution

Notes

6 Answers 6

Matching the lines

Editing them

You must log in to answer this question.

Linked

Hot Network Questions

regex to find text plus trailing 10+ spaces; sum of two matches' lengths is part of matching condition

TL;DR

A More Thorough Example of the Problem

My Clunky Solution

Notes

6 Answers 6

Matching the lines

Editing them

You must log in to answer this question.

Linked

Related

Hot Network Questions