How to remove newline character within a double-quoted column in a tsv file?

Question

I need "THIS CONTAIN NEWLINE" in one row (newline from the column2 and column4 need to remove).

"column1"   "column2"          "column3"    "column4"           " column5"
"DATA"       "THIS                 "DATA"   "THIS                "DATA"
             CONTAIN NEWLINE"                 CONTAIN NEWLINE"

Expected output:

"column1"   "column2"                "column3"  "column4"                " column5"
"DATA"      "THIS CONTAIN NEWLINE"    "DATA"    "THIS CONTAIN NEWLINE"     "DATA"

the number of \t (tab) characters are consistent? for example there is exactly one \t before CONTAIN NEWLINE"? — goteguru
– goteguru, Commented Apr 4, 2018 at 13:20
It's not clear how you can differentiate between new data line and \n in the first column. — goteguru
– goteguru, Commented Apr 4, 2018 at 13:23
if the first column contains a\nb\nc\n how do you know the lines starting with b and c are not separate data lines with missing fields? Or there can be no \n in the first column? — goteguru
– goteguru, Commented Apr 4, 2018 at 13:29
Sorry, but the file format is still unclear to me. If this really is a "TSV" (Tab-Separated-Values) file, then it seems unlikely the columns in your sample data would look so neatly aligned. Are you sure this isn't a format with fixed-width columns (even if those widths are different across different files)? The typical tools for reading / writing files like that in Perl are unpack resp. printf. — haukex
– haukex, Commented Apr 4, 2018 at 19:01

Kusalananda · Accepted Answer · 2018-04-06 10:10:27Z

What makes this text-processing issue difficult or unusual is that the columns have to be processed individually.

This is a bit of a hack, but it seems to do the job:

#!/bin/sh

rm -f newfile
for column in 1 2 3 4 5; do
        cut -f "$column" file |
        perl -ne 'chomp;$nl = ((tr /"/"/ % 2 == 0) ? "\n" : " "); print $_, $nl' |
        sed -e 's/[[:blank:]]*$//' -e '/^[[:blank:]]*$/d' |
        { if [ -f newfile ]; then
                paste newfile -
          else
                cat
          fi
        } >newfile.tmp
        mv newfile.tmp newfile
done

The script assumes that the input file is called file and it will create a file called newfile (and using the filename newfile.tmp for temporary data). It further assumes that the columns are properly tab-delimited.

It extracts the tab-delimited columns, one by one, from the original file with cut. Each individual column is passed through a short Perl script:

chomp;
$nl = ( ( tr /"/"/ % 2 == 0 ) ? "\n" : " " );
print $_, $nl;

This counts the number of double quotes on each line and will output the line with an appended newline if the line contains an even number of double quotes. If the number of quotation marks is odd, it will instead append a space character at the end of the line (thus merging quoted strings that span lines). This is a hackish way of doing this.

The sed will do some cleanup, removing trailing blank space from the end of lines and deleting empty lines.

I then paste this new data as a new tab-delimited column in newfile (by first outputting to newfile.tmp and then renaming this file). The cat is run only for the very first column when newfile does not yet exist.

With the given input data, assuming that the columns are properly tab-delimited, this will produce the following tab-delimited file:

"column1"       "column2"       "column3"       "column4"       " column5"
"DATA"  "THIS CONTAIN NEWLINE"  "DATA"  "THIS CONTAIN NEWLINE"  "DATA"

With the tabs replaced by pipe-symbols (to show where the columns start and end):

$ tr '\t' '|' <newfile
"column1"|"column2"|"column3"|"column4"|" column5"
"DATA"|"THIS CONTAIN NEWLINE"|"DATA"|"THIS CONTAIN NEWLINE"|"DATA"

Praveen Kumar BS · Accepted Answer · 2022-06-01 08:25:42Z

0

co=$(awk '/^[[:space:]]/{print $0}' filename |perl -pne 's/"/\n/g' |sed -r -e 's/^\s+//g' -e '/^$/d'|awk '{if(!seen[$0]++)print }')

awk -v co="$co" 'NR==1{print }/DATA/{$2=$2" "co"\"";$4=$4" "co"\"";print $0}' filename

output

"column1"   "column2"          "column3"    "column4"           " column5"
"DATA" "THIS CONTAIN NEWLINE" "DATA" "THIS CONTAIN NEWLINE" "DATA"

edited Jun 1, 2022 at 8:25

answered Jun 1, 2022 at 8:14

Praveen Kumar BS

5,3212 gold badges12 silver badges16 bronze badges

Add a comment |

jubilatious1 · Accepted Answer · 2022-06-01 15:45:07Z

Using Raku (formerly known as Perl_6)

~$ raku -e 'my @a = [Z] lines.map(*.split: "\t");                 \
            @a.=map: *.join(" "); @a.=map: *.comb(/\" .+?  \"/);  \ 
            $_.join("\t").put for [Z] @a;'   file

[Nota bene: the file supplied by the OP more resembles a groff outputted table than a tsv file. If this was a tsv file, the internal \n in row2/column2 would push row2 data for columns 3,4,5 to the next line].

This solution assumes a substantially cleaned-up input file. Table text has been shortened ("INTERNAL\nNEWLINE" instead of "THIS\nCONTAIN NEWLINE") to visualize the columns better (the code still works on fields with internal whitespace, however). Finally, the third (overflow) line--containing the 'continuation' of columns 2 and 4--has inserted "" column placeholders as described generally by the OP. This ensures that each column has an entry per row, even if it's an empty string.

Briefly, in the first statement for each line, columns are split on \t tabs, then [Z] zip-reduced to transpose rows and columns. The result is assigned to @a array. In the second statement, each row (formerly column) is joined together into a single string, reconstituting the formerly 'broken' double-quoted data in columns 2 and 4.

In the third statement, each row (formerly column) is combed, i.e. broken around double-quoted text, returning 2 elements per row. This 5 x 2 row/column array is then again [Z] zip-reduced (i.e. transposed) to 2 x 5 row/column array, columns are joined via \t tabs, and output.

Sample Input (tab-separated columns):

"column1"   "column2"   "column3"   "column4"   "column5"
"DATA1" "INTERNAL   "DATA3" "INTERNAL   "DATA5"
""  NEWLINE"    ""  NEWLINE"    ""

Sample Output (tab-separated columns):

"column1"   "column2"   "column3"   "column4"   "column5"
"DATA1" "INTERNAL NEWLINE"  "DATA3" "INTERNAL NEWLINE"  "DATA5"

There are two simple ways to better visualize the output columns above: the first is to join on | pipe characters as @Kusalananda did. Replacing the last join("\t") by join("|") instead returns the following:

"column1"|"column2"|"column3"|"column4"|"column5"
"DATA1"|"INTERNAL NEWLINE"|"DATA3"|"INTERNAL NEWLINE"|"DATA5"

The second (and possibly more informative way is to call .raku or .perl on the output, which gives a 'data-dumper'-like return (i.e. a peek at Raku's internal representation of the data). Thus a last statement consisting of .raku.put for [Z] @a; returns the following (note escaped double-quotes):

("\"column1\"", "\"column2\"", "\"column3\"", "\"column4\"", "\"column5\"")
("\"DATA1\"", "\"INTERNAL NEWLINE\"", "\"DATA3\"", "\"INTERNAL NEWLINE\"", "\"DATA5\"")

https://raku.org

Stack Exchange Network

How to remove newline character within a double-quoted column in a tsv file?

3 Answers 3

You must log in to answer this question.

Hot Network Questions

How to remove newline character within a double-quoted column in a tsv file?

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions