How to remove double quotes within the double-quoted field values in .dat file

Question

I have a text file that has around 15 columns. The fields are separated by comma. One column that is description is double-quoted and also has some words which are double-quoted. I need to retain the beginning and ending double quotes and remove only the inner double quotes.

Something like this:

"Hi there, we are from XYZ team, we have an "Opportunity" at our organization"

I need output as:

"Hi there, we are from XYZ team, we have an Opportunity at our organization"

I don't want to go for Python programming. I was looking for an awk command or any other best option.

The file might have 100 lines of data but this description column has double-quoted word for few lines and not for all 100 lines.

Here is some sample data:

invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number

1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,279.6,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included"  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

I have to remove double quotes for "Material Included" in the line description.

Please Note: I need entire file and retain all the columns, but just remove inner double quotes in the line description value. Only the line description field has such inner double quoted values. As for now, there is only one inner double quoted word that is coming up for line description the file, we haven't noticed more than one.

Could you please show a complete line from your input data, including all fields? Does the file also contain a header line? Also clarify whether you still want to output all other fields or whether you want to extract only the modified variant of this text field. — Kusalananda
– Kusalananda ♦, Commented May 3, 2024 at 3:34
awk -F ',' -v OFS=',' '{ for (i=2; i<=NF; i+=2) gsub(/"/, "", $i) } 1' DATA_IN.dat > DATA_IN_final.dat I used this, and now the line description field has starting quotes and has removed in between double quotes but the ending quotes are also removed for few of the lines and not for all. any way to add ending quotes for the lines that doesn't have ending quotes? — Mythri
– Mythri, Commented May 3, 2024 at 11:32
If you can have more than 1 quoted field per line then we'd need more information in the question to tell us how to distinguish quotes within fields from quotes around fields because from what's currently shown in the question we couldn't tell if "foo","bar" is a single field that contains 2 quotes and a comma or 2 fields separated by a comma. — Ed Morton
– Ed Morton, Commented May 3, 2024 at 21:07
I understand you've stated your preference, but you have to anticipate things like embedded newlines within fields, etc. Such considerations mean you should try to get your file into RFC 4180 format, via the use of a dedicated CSV parser. Best Regards. — jubilatious1
– jubilatious1, Commented May 6, 2024 at 5:16

Kusalananda · Accepted Answer · 2024-05-03 23:23:44Z

Note: I'm not using the provided data from the question as the number of header fields does not seem to match up with the number of data fields. Instead I use printf to create a simple data set with the same quoting issue as described in the question.

Using Miller (mlr) as shown below, you will be able to convert the problematic embedded double quotes into properly CSV-encoded embedded double quotes. This includes doubling up each embedded double-quote character:

$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"'
a,b,c
aaa,"bb "bb" bb","c"cc"
$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"' | mlr --csv --lazy-quotes cat
a,b,c
aaa,"bb ""bb"" bb","c""cc"

This would create a CSV document that any CSV-aware parser would be able to read correctly, preserving the embedded quotes.

To completely remove the embedded double quotes, you may use Miller like so:

$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"' | mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'
a,b,c
aaa,bb bb bb,ccc

This uses mlr to iterate over all fields in all records and to remove any double-quote character found.

If a field needs quoting due to containing a comma, then Miller will quote it:

$ printf '%s\n' a,b,c 'aaa,"b,b "bb" bb","c"cc"' | mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'
a,b,c
aaa,"b,b bb bb",ccc

The Miller command again, but by itself:

mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'

If you know the name of the field that contains the quotes that you wish to remove, e.g. line description, then you may simplify the command and remove the loop:

mlr --csv --lazy-quotes put '$["line description"] = gssub($["line description"], "\"", "")'

jubilatious1 · Accepted Answer · 2024-05-05 05:38:08Z

Using Raku (formerly known as Perl_6)

~$ raku -MText::CSV  -e 'my @a = csv(in => $*IN, sep => ",", escape_char => "", allow_loose_quotes => 1); csv(in => @a, out => $*OUT);'  < file

Raku is a programming language in the Perl-family that provides high-level support for Unicode. While the module ecosystem is still small, you do have Raku's powerful Text::CSV module available to you. To provide some provenance: Raku's Text::CSV module is primarily written by a longtime author/maintainer of Perl's Text::CSV_XS module (H. Merijn Brand, personal communication).

The answer above closely parallels the excellent Perl answer by @kos. The code returns the same canonical CSV result in accordance with RFC 4180.

The data is read into @a array,
The escape_char is set to empty-string,
Loose quotes are allowed (allow_loose_quotes).

Sample Input:

invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number
1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included"  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

Sample Output:

"invoice number","invoice date","vendor number","vendor site ID","supplier site CODE","invoice description","invoice currency code","invoice total amount","line number","line amount","line description","account code","business unit","business center",department,"issue code",project,"task number"
1686,2024-03-28,258,9845,NEWYORK,"CA Project: Content",USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research ""Material Included""  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

For further details, see the first and second links below.

https://raku.land/zef:Tux/Text::CSV
https://github.com/Tux/CSV/blob/master/doc/Text-CSV.md
https://raku.org

kos · Accepted Answer · 2024-05-05 11:47:21Z

Rather than removing the unescaped double quotes from the input, as I guess it'd be better if they stayed, you could convert the malformed file into a proper and standard "double-quotes-escaped-double-quotes-characters" (I apologize) CSV, where doubling double quotes ("") is used as a mean to escape them when inside quoted text fields.

This can be done auto-magically and addressing the whole file, without having to address specific rows / fields, using Perl's Text::CSV module (not installed by default neither on Ubuntu nor on openSUSE Tumbleweed, sudo apt install libtext-csv-perl / sudo zypper in perl-Text-CSV; but in any case it's a very standard module, and it should be available in most / all Linux distros; and of course it would still be installable on any system lacking it via CPAN).

perl -M'Text::CSV qw(csv)' -e '
    csv(
        in => csv(
            in => "in",
            allow_loose_quotes => 1,
            escape_char => undef(),
        )
    )
'

What this does is:

it opens a file named "in", reading it as a CSV, without intepreting any character as an escape character to the default quote_character (") (this is the trick to allow the parser to read in " characters as regular characters when inside the default-quote-character-delimited text field boundaries); this, combined with allow_loose_quotes, tells the parser to not complain when reading a non-escaped default quote_character inside a text field, ultimately forcing it to read the contents of text fields verbatim; an output CSV is then generated using standard options (which include quoting text fields and doubling double quotes inside text fields when needed) and printed to STDOUT.

% cat in
invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number

1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included"  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10
% perl -M'Text::CSV qw(csv)' -e '
        csv(
                in => csv(
                        in => "in",
                        escape_char => undef(),
                        allow_loose_quotes => 1,
                )
        )
'
"invoice number","invoice date","vendor number","vendor site ID","supplier site CODE","invoice description","invoice currency code","invoice total amount","line number","line amount","line description","account code","business unit","business center",department,"issue code",project,"task number"

1686,2024-03-28,258,9845,NEWYORK,"CA Project: Content",USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research ""Material Included""  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

dbran · Accepted Answer · 2024-05-05 15:00:39Z

You could try using sed and its branching feature, which gives you more control over when to make a substitution:

#!/bin/sh

regex='"([^"]*)"([^"]*)"'
replace='"\1\2"'

sed -E ":x ; s/$regex/$replace/ ; tx" file.txt

Or directly from the command line:

$ sed -E ':x ; s/"([^"]*)"([^"]*)"/"\1\2"/ ; tx' file.txt

As long as there is a match, tx will cause the process flow to jump back to :x, resulting in the substitution command s/... being executed again. In this case it will remove one inner quote at a time until there is no match anymore. If the match would contain three inner quotes, as in "ab"""c", the substitution command would run three times and the output would be "abc".

Note, however, that this solution won't give the expected result if there is more than one quoted field on the same line. It would effectively keep only the very first and last quote, so for example "a "b" c" "d "e" f" would result in "a b c d e f".

For more info, check out the GNU manual: 6.4 Branching and Flow Control.

Ed Morton · Accepted Answer · 2024-05-05 12:42:52Z

If you can only have 1 quoted field at most per line then you could do the following using any awk:

$ awk '
    match($0,/".*"/) {                         # find the string from the first to the last `"` on this line
        fld = substr($0,RSTART+1,RLENGTH-2)    # save it in the variable `fld`
        gsub(/"/,"",fld)                       # remove all `"`s from it
        $0 = substr($0,1,RSTART) fld substr($0,RSTART+RLENGTH-1)   # piece `$0` back together replacing the original `fld` string with the modified one
    }
    { print }                                  # print $0
' file
"Hi there, we are from XYZ team, we have an Opportunity at our organization"
invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number
1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research Material Included  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

or this with any sed that interprets \n to mean newline (otherwise use \<literal newline> instead):

$ sed 's/"\(.*\)"/\n\1\n/; s/"//g; s/\n/"/g' file
"Hi there, we are from XYZ team, we have an Opportunity at our organization"
invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number
1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research Material Included  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

If you can have more than 1 quoted field per line then it's impossible to do this job robustly with any tool without additional information on how to identify quotes within fields vs around fields.

The above were run on this input file constructed from the sample lines in the question:

$ cat file
"Hi there, we are from XYZ team, we have an "Opportunity" at our organization"
invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number
1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included"  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

EDIT: if you really want to have the quotes string appear as a single field in $0 then here's one way you could make that happen, again assuming you only have 1 quoted string per record:

$ cat file
1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research Material Included  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

$ cat tst.awk
BEGIN { FS=OFS="," }
match($0,/".*"/) {
    fld = substr($0,RSTART,RLENGTH)
    $0 = substr($0,1,RSTART-1) "\"" substr($0,RSTART+RLENGTH)
    for ( i=1; i<=NF; i++ ) {
        if ( $i == "\"" ) {
            $i = fld
        }
    }
}
{
    for ( i=1; i<=NF; i++ ) {
        print i "\t" $i
    }
}

$ awk -f tst.awk file
1       1686
2       2024-03-28
3       258
4       9845
5       NEWYORK
6       CA Project: Content
7       USD
8       538
9       1
10      26
11
12      232130
13
14
15
16
17      2915
18      "Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research Material Included  and  artwork , and email. Communications with team website. Call, and communications."
19      230
20
21
22
23
24      295
25      10

Note that you cannot modify $0 as a whole after doing the above or awk will re-split $0 on commas again:

$ cat tst.awk
BEGIN { FS=OFS="," }
match($0,/".*"/) {
    fld = substr($0,RSTART,RLENGTH)
    $0 = substr($0,1,RSTART-1) "\"" substr($0,RSTART+RLENGTH)
    for ( i=1; i<=NF; i++ ) {
        if ( $i == "\"" ) {
            $i = fld
        }
    }
}
{
    $0 = $0    # even this will cause awk to re-split `$0` at `,`s
    for ( i=1; i<=NF; i++ ) {
        print i "\t" $i
    }
}

$ awk -f tst.awk file
1       1686
2       2024-03-28
3       258
4       9845
5       NEWYORK
6       CA Project: Content
7       USD
8       538
9       1
10      26
11
12      232130
13
14
15
16
17      2915
18      "Review new applications
19       and instruct the same.The deposits. Review correspondence applications. Review and applications. Research Material Included  and  artwork
20       and email. Communications with team website. Call
21       and communications."
22      230
23
24
25
26
27      295
28      10

It's important when using awk to understand the difference between these 2 ways of modifying the current record:

Modifying any field, e.g. $1, causes awk to reconstruct $0, replacing all FSs with OFSs but it does NOT re-split the record.
Modifying the record as a whole, i.e. $0, causes awk to re-split the record into fields separated by FSs but it does NOT reconstruct $0 replacing FSs with OFSs.

If you understand that then you'll understand this output, especially NF being 1 for the 4th output:

$ echo 'a,b,c' | awk -F',' -v OFS='@' '{$1=$1; print NF "\t" $0}'
3       a@b@c

$ echo 'a,b,c' | awk -F',' -v OFS='@' '{$0=$0; print NF "\t" $0}'
3       a,b,c

$ echo 'a,b,c' | awk -F',' -v OFS='@' '{$0=$0; $1=$1; print NF "\t" $0}'
3       a@b@c

$ echo 'a,b,c' | awk -F',' -v OFS='@' '{$1=$1; $0=$0; print NF "\t" $0}'
1       a@b@c

Thank you so much! the awk command worked for me, can you please explain the code. I am trying to understand what each of those lines do. I would have played with the code, but right now fully packed up with lot of development work. I tried to replace $0 with $11(line description column) but it did not work. Why? — Mythri
– Mythri, Commented May 5, 2024 at 0:31
I added comments to the awk script. You seem to be missing the main point of my answer and comment - you can't use $11 because given that "s and ,s can exist inside "-delimited fields it is impossible for tools to tell where any field starts/stops using normal field splitting on ,s. So while awk will populate $11 it can be some substring inside a quoted field rather than a whole field. — Ed Morton
– Ed Morton, Commented May 5, 2024 at 12:05
If you really want to have your quoted string exist as a single field in the record, I updated my answer to show one way to do it. — Ed Morton
– Ed Morton, Commented May 5, 2024 at 12:07

Stack Exchange Network

How to remove double quotes within the double-quoted field values in .dat file

5 Answers 5

You must log in to answer this question.

Linked

Hot Network Questions

How to remove double quotes within the double-quoted field values in .dat file

5 Answers 5

You must log in to answer this question.

Linked

Related

Hot Network Questions