I'm trying to clean files that are copy/pasted versions of my Cygwin (mintty
) terminal running bash
1. Usually, the input and output are separated by a linefeed ('\n
'), as expected. However, when I look at the output in my text editor/IDE (Notepad++, in my case), the command is sometimes followed by enough spaces to get to the terminal width and then the output. (The terminal width can be found by running tput cols
in the original terminal.) I can't tell the difference when looking at my terminal, but it's evident when looking at my programmer's notebook. I do know about other terminal logging solutions2, but my question is about the copy/paste situation.
TL;DR
If I have the following file (a log of the terminal input/output where the terminal width is 80 characters). The middle two commands (a command being a line starting with $
) don't show the problem and have a comment with OK
after them.
$ tput cols 80
$ type wc # OK
wc is hashed (/usr/bin/wc)
$ # This one is fine. OK
$ echo "Cédric,Žemaičių Naumiesčio" Cédric, Žemaičių Naumiesčio
$
I would like to change it (either by checking each match of a bad line one-by-one and manually making the necessary edits, by editing the file in-place, or by redirecting the edited version to a separate file), so that I get
$ tput cols
80
$ type wc # OK
wc is hashed (/usr/bin/wc)
$ # This one is fine. OK
$ echo "Cédric,Žemaičių Naumiesčio"
Cédric, Žemaičių Naumiesčio
$
I imagine that raku
or perl
might have some solution to count the initial characters plus spaces to see if they match the column width—here it's 80. Could anyone show me such raku
/perl
solution? Any sed
or awk
solutions would also be nice, as they would allow me to use the search-and-replace inside vim
, with the similar regex style. Does anyone know how to do this? I have what I feel to be a clunky solution using bash
with the constructs, ~=
, ${BASH_REMATCH[
n]}
, ${#some_string}
and $((
... ))
arithmetic. This solution is further down the question. I hope that there's something more elegant than my attempt, specifically a one-liner or small function/script.
By the way, this isn't so important that it needs to be fail-proof; I'm trying to a check for initial-characters + 10-or-more-spaces
having a length of 80 (or whatever the terminal width happens to be) and a non-space character after the last of the spaces. Edit: I wasn't clear on this before, but note that lengths that are multiples of 80 including the 10-or-more last spaces also match for being split. However, the character count does need to be of UTF-8 encoded characters.
A More Thorough Example of the Problem
Though I apologize for the overflow requiring horizontal scrolling, I need it to show an example of the problem. One such example is in the contents of the file, terminal_logfile_woes.log
.
$ tput cols 80
$ type wc # OK
wc is hashed (/usr/bin/wc)
$ # This one is fine. OK
$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list # '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'\
$ find ./the_dir_with_thirteen_files/ -type f | wc -l 13
$ find . -type f -iname "*no_file_with_this*" | grep -oP "[\x00-\x20\x7F-\xFF]" | sed 's#\(.\)#'"'"'\1'"'"', #g; s#, $##g;' | wc -l 0
$ a_short_string="abc" # OK
$ a_quite_long_string="abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWon'tYouSingWithMe; This song ruined 'zed' for us Americans : (" # OK
$ echo "${a_quite_long_string}" abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (
$ whoami # OK
bballdave025
$ echo "你好。我不知道。" 你好。我不知道。
$ echo I want to put in more=15 spaces. # OK
I want to put in more=15 spaces.
$ echo "Cédric,Žemaičių Naumiesčio" Cédric,Žemaičių Naumiesčio
$
A quick note about the find . -type f -iname "*no_file_with_this*"
... 0
line. I've entered it here in the editor, checking that there are 161 character glyphs including the 0
. (I press right-arrow 161 times from the very beginning and end up at the right side of the zero.) However, when I copy it back and forth from here, from my Cygwin terminal, from Notepad++, from vim
, from my word processor, etc., I sometimes end up with 167 total glyphs, sometimes with 153 total glyphs, sometimes with 162 total glyphs, and a few times with other numbers. If you can't get that line right, you'll either need to check that your version of the logfile has 161 glyphs or don't worry about that line. As far as I can tell, when I count it it from the "posted" question window (not the preview), it always has 161 glyphs. There should be 29 spaces between the l
of wc -l
and the 0
at the end; 131 glyphs through wc -l
, 29 spaces, and the last 0
makes 161.
Maybe I'm doing my counting wrong. Curiously enough, my function worked with an older version that, when pasted in the editor, had only 153 glyphs. It doesn't work with the current version. Perhaps it's best to leave the condition as some of you have suggested: "change sequences of 10 or more spaces to a new line if the line length is > 80 characters" (as stated by @cas).
The count needs to be of UTF-8 encoded characters, since that's what my terminal is set up to use. I can use a solution that will count correctly for everything but not-necessarily-monospaced-in-monospace-font characters, such as CJK characters (archived Wikipedia page as I see it) (though I prefer calling them CJKV characters {archived}). My clunky solution, below, does not handle these correctly. A good test is that last string, Cédric,Žemaičių Naumiesčio
, which should have 26 characters. The following two commands both give the wrong count, because they are counting bytes, not UTF-8 characters encoded as bytes.
$ printf %s "Cédric,Žemaičių Naumiesčio" | wc -c
31
$ printf %s "Cédric,Žemaičių Naumiesčio" | LC_ALL=C grep -o . | wc -l
31
Trying the last command without the line count, i.e. printf %s "Cédric,Žemaičių Naumiesčio" | LC_ALL=C grep -o .
, might help to illustrate the issue.
The next command does give the correct count.
$ printf %s "Cédric,Žemaičių Naumiesčio." | LC_ALL=C.UTF-8 grep -o . | wc -l
27
Again, it might be illustrative to see the output without the | wc -l
, i.e. printf %s "Cédric,Žemaičių Naumiesčio." | LC_ALL=C.UTF-8 grep -o .
and to compare it with the LC_ALL=C
version.
(You can also try the string, 你好。我不知道。
—the count should be 8. However, the different width of CJKV characters in many fonts means that the general principle will not work, as the number of UTF-8 encoded characters isn't 80 when the characters go to the next line. If I really needed to keep everything the same width—a truly monospaced Unicode Plane 0 font—I could use Unifont {archived} for my terminal font.)
My Clunky Solution
Because it was easier for providing an example for this question, I created the following function from the terminal prompt, though I could just as easily have created a script for it. It's based on this answer (archived) from @glenn-jackman here on U&L SE.
$ find_spaces_not_linefeed() {
use="Usage:\n% find_spaces_not_linefeed LOG [WIDTH] [DO_CHECK_LENGTH]"
if [ $# -eq 0 ]; then echo -e "Path to logfile required\n${use}"; fi
if [ $# -ge 1 ]; then input_logfile_name=$1; fi
terminal_width=80; if [ $# -ge 2 ]; then terminal_width=$2; fi
do_check=0; if [ $# -ge 3 ]; then do_check=$3; fi
change_count=0
while IFS= read line; do
this_str="${line}"
if [[ $this_str =~ (^.+)([ ]{10,})([^ ].*$) ]]; then
beg=${BASH_REMATCH[1]} spaces=${BASH_REMATCH[2]} end=${BASH_REMATCH[3]}
len_beg="${#beg}" len_spaces="${#spaces}" len_end="${#end}"
if [ $do_check -ne 0 ]; then
echo; echo "# FOR CHECKING #"; echo -n "len_beg: ${len_beg} "
echo -n "len_spaces: ${len_spaces} "; echo "len_end: ${len_end}"
fi ##endof: if [ $do_check -ne 0 ]
is_a_match=0 # guilty until proven innocent
test_value=$(( ($len_beg + $len_spaces) % 80 ))
test ${test_value} -eq 0 && is_a_match=1
if [ $is_a_match -eq 1 ]; then
change_count=$(echo "${change_count}+1" | bc)
if [ $do_check -eq 0 ]; then echo; fi
echo " CHANGE"; echo "${this_str}"; echo " TO"
echo "${beg}" # Will put in a linefeed between the two
echo "${end}"
echo "##### WITH a linefeed and not spaces."
fi ##endof: if [ $is_a_match -eq 1 ]
fi ##endof: if <bash_rematch regex matches>
done < "${input_logfile_name}"
test $change_count -eq 0 && echo "No matches found. No changes needed."
} ##endof: find_spaces_not_linefeed
I have a different use
variable, with a more-detailed string, defined as a heredoc3.
When I run it on terminal_logfile_woes.log
, I get the following, where I've included my $PS1
(prompt info) to differentiate my terminal and the logged terminal's I/O.
bballdave025@MY_MACHINE ~/logfile_problems
$ find_spaces_not_linefeed terminal_logfile_woes.log 80
CHANGE
$ tput cols 80
TO
$ tput cols
80
##### WITH a linefeed and not spaces.
CHANGE
$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list # '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'
TO
$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list
# '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'
##### WITH a linefeed and not spaces.
CHANGE
$ find ./the_dir_with_thirteen_files/ -type f | wc -l 13
TO
$ find ./the_dir_with_thirteen_files/ -type f | wc -l
13
##### WITH a linefeed and not spaces.
CHANGE
$ find . -type f -iname "*no_file_with_this*" | grep -oP "[x00-x20x7F-xFF]" | sed 's#(.)#'"'"'1'"'"', #g; s#, $##g;' | wc -l 0
TO
$ find . -type f -iname "*no_file_with_this*" | grep -oP "[x00-x20x7F-xFF]" | sed 's#(.)#'"'"'1'"'"', #g; s#, $##g;' | wc -l
0
##### WITH a linefeed and not spaces.
CHANGE
$ echo "${a_quite_long_string}" abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (
TO
$ echo "${a_quite_long_string}"
abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (
##### WITH a linefeed and not spaces.
CHANGE
$ echo "Cédric,Žemaičių Naumiesčio" Cédric, Žemaičių Naumiesčio
TO
$ echo "Cédric,Žemaičių Naumiesčio"
Cédric, Žemaičių Naumiesčio
##### WITH a linefeed and not spaces.
bballdave025@MY_MACHINE ~/logfile_problems
$
I haven't attempted a search and replace, because I'm very unsure how everything in terminal input and output (e.g. $
, #
, many single and double quotes, all the types of brackets) could be consistently escaped in a search and replace.
Edit: I apologize for not being clearer with my expected output.
The output above is sort of a pre-version, basically stopping just short of doing the search and replace. A lot of the answers now—1755528589
a.k.a. 2025-08-18T164949+0000
— have already done the search and replace. The even-better expected output, with the search and replace performed, is
$ tput cols
80
$ type wc # OK
wc is hashed (/usr/bin/wc)
$ # This one is fine. OK
$ cat files_to_rectify_1754293729_2025-08-04T014849-0600.list
# '.' is '/cygdrive/c/David/FHTW-2025-All_-_move_2024_get_new/'\
$ find ./the_dir_with_thirteen_files/ -type f | wc -l
13
$ find . -type f -iname "*no_file_with_this*" | grep -oP "[\x00-\x20\x7F-\xFF]" | sed 's#\(.\)#'"'"'\1'"'"', #g; s#, $##g;' | wc -l
0
$ a_short_string="abc" # OK
$ a_quite_long_string="abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWon'tYouSingWithMe; This song ruined 'zed' for us Americans : (" # OK
$ echo "${a_quite_long_string}"
abcdefghijklmnopqrstuvwxyzNowIKnowMyABCsNextTimeWontYouSingWithMe; This song ruined 'zed' for us Americans : (
$ whoami # OK
bballdave025
$ echo "你好。我不知道。" 你好。我不知道。
$ echo I want to put in more=15 spaces. # OK
I want to put in more=15 spaces.
$ echo "Cédric,Žemaičių Naumiesčio"
Cédric,Žemaičių Naumiesčio
$
Note that I included the CJK(V) characters, I included them as an example of something that I don't need fixed, but I appreciate those of you who have looked into the solution to that issue as well.
A lot of you have gotten to my expected output, even without my being overly clear. It will be quite a task to choose an accepted answer. I'll give it about a day, since some of you might change your answers given my clarifications.
Notes
[1]
My System
$ uname -a
CYGWIN_NT-10.0-19045 MY_MACHINE 3.6.3-1.x86_64 2025-06-05 11:45 UTC x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 5.2.21(1)-release (x86_64-pc-cygwin)
[2]
Instead of the line with the use="Usage...
code in the version of the function posted above, my function actually has its use
variable defined in a heredoc.
IFS='\n' read -r -d '' use <<'EndOfUsage'
Usage:
% find_spaces_not_linefeed LOGFILENAME [TERMINAL_WIDTH] [DO_CHECK_FOR_LENGTH]
LOGFILENAME The name of the file to which you pasted the terminal
I/O that was copied from a Cygwin (mintty) terminal. If
there is no path included, it is assumed the file is in the
working directory. This is the only required argument.
TERMINAL_WIDTH Defaults to 80. If you still have the Cygwin (mintty)
terminal (from which you copy/pasted) up, you can find this
with `tput cols'. If you're using a new terminal to run
this `find_spaces_not_linefeed', there is no guarantee.
DO_CHECK_FOR_LENGTH A boolean that defaults to 0. Any other (`!= "0"') val
will give lengths of anything in a command before 10 or more
spaces, the number of spaces (if 10 or more) and the length
of anything (presumably the output from the command) after
the 10-plus spaces.
`find_spaces_not_linefeed' is meant to correct instances where copying from
a Cygwin (mintty) terminal and pasting into a text editor/IDE sometimes
doesn't give a linefeed between a command and its output, but instead adds
spaces until a multiple of the terminal width is reached
( `[ $(( length % $(tput cols) -eq 0 ]' is true ),
then prints the output of the command. It doesn't deal with edge cases,
including the fact that there could be instances where the length of the
command would make the number of spaces less than 10 or even 0.
EndOfUsage
This could probably also serve as a good recap of the reasons for my question.
[3]
While I do use what I call "script scrubbers", most based on a 2008 Perl
function (archived) by @repellent in response to @mrslopenk's question on PerlMonks—I can share a more rubust extension of the code here on Unix.SE (archived) or more (archived), similar ways and/or more unique ways (archived) ways to automatically log and read terminal I/O (archived), and then some (archived) ... here's a new good source (archived) I want to try—sometimes a good ol' Select All and Copy/Paste is a good way to grab the contents of the Cygwin terminal window. As you see here, Copy/Paste has a few problems, the same as all the others solutions have their own problems.
Edit: Something I hadn't clarified (sorry @cas), the solutions are used with the output of the script
command.
tput cols
call? are we supposed to use the result of thetput cols
call to determine where to 'split' lines?od -c
) to see if there's a non-printing character that delimits the 'first' and 'second' lines?