Find text between tab (\t) as a delimiter

Question

I thought this will be simple, but can't find out how to do it.

Scenario

I have a single .csv file with id_user,text,id_group columns where each column is delimited by tabs such like:

"123456789"        "Here's the field of the text, also contains comma"        "10"
"987456321"        "Here's the field of the text, also contains comma"        "10"
"123654789"        "Here's the field of the text, also contains comma"        "11"
"987456123"        "Here's the field of the text, also contains comma"        "11"

How to find the the text?

Attempt

awk

I was looking for a way to specify the print $n delimiter, if I could do it an option will be

$ awk -d '\t' '{print $2}' file.csv | sed -e 's/"//gp'

where -d is the delimiter for the print option and the sed to take out the "

check it stackoverflow.com/questions/5374239/tab-separated-values-in-awk — lese
– lese, Commented Sep 8, 2015 at 15:09
@Thor please update your answer with this option to accept it, this is what I was looking for — tachomi
– tachomi, Commented Sep 8, 2015 at 16:00
@tachomi: You could also use the double-quotes as the delimiter. See my answer. — Thor
– Thor, Commented Sep 9, 2015 at 9:11

Thor · Accepted Answer · 2015-09-09 09:10:34Z

TAB delimiter

cut

You do not need sed or awk, a simple cut will do:

cut -f2 infile

awk

If you want to use awk, the way to supply the delimiter is either through the -F argument or as a FS= postfix:

awk -F '\t' '{ print $2 }' infile

Or:

awk '{ print $2 }' FS='\t' infile

Output in all cases:

"Here's the field of the text, also contains comma"
"Here's the field of the text, also contains comma"
"Here's the field of the text, also contains comma"
"Here's the field of the text, also contains comma"

Quote delimiter

If the double-quotes in the file are consistent, i.e. no embedded double-quotes in fields, you could use them as the delimiter and avoid having them in the output, e.g.:

cut

cut -d\" -f4 infile

awk

awk -F\" '{ print $4 }' infile

Output in both cases:

Here's the field of the text, also contains comma
Here's the field of the text, also contains comma
Here's the field of the text, also contains comma
Here's the field of the text, also contains comma

heemayl · Accepted Answer · 2015-09-08 15:08:45Z

You can use grep with PCRE (-P) :

grep -Po '\s"\K[^"]+(?="\s)' file.txt

\s" matches any whitespace followed by a ", \K discards the match
[^"]+ get our desired portion between two "s
(?="\s) is a zero width positive lookahead pattern ensuring the required portion is followed by " and any whitespace character.

Example :

$ grep -Po '\s"\K[^"]+(?="\s)' file.txt 
Here's the field of the text, also contains comma
Here's the field of the text, also contains comma
Here's the field of the text, also contains comma
Here's the field of the text, also contains comma

tachomi · Accepted Answer · 2015-09-08 16:06:22Z

2

To specific the tab as a delimiter

$ awk -F '\t' '{print $2}' file.csv

To take away the unwanted "

$ awk -F '\t' '{print $2}' file.csv | sed 's/"//g'

Other option using awk -F

$ awk -F '"' '{print $4}' file.csv

answered Sep 8, 2015 at 16:06

tachomi

7,8804 gold badges28 silver badges45 bronze badges

Add a comment |

Sobrique · Accepted Answer · 2015-09-08 16:34:31Z

I would use perl for this, because Text::CSV is really good for handling non-trivial CSV (e.g. involving quotes):

#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;

open ( my $input, '<', "file.csv" ) or die $!;   
my $csv = Text::CSV -> new ( { binary => 1, 
                               sep_char => "\t", } );

while ( my $row = $csv -> getline ( $input ) ) {
    print $row -> [1],"\n";
}
close ( $input );

Prints:

Here's the field of the text, also contains comma
Here's the field of the text, also contains comma
Here's the field of the text, also contains comma
Here's the field of the text, also contains comma

Dhiren Dash · Accepted Answer · 2015-09-08 17:33:51Z

1

Your sed part is correct. You can either use awk -F '\t' or the following,

awk 'BEGIN{FS="\t"} {print $2}' file.csv | sed 's/"//g'

or if you do not want to use sed, you can pipe the output of the first awk to the second awk and then use '"' as the field delimiter and then print the second field.

awk 'BEGIN{FS="\t"} {print $2}' file.csv | awk -F "\"" '{print $2}'

answered Sep 8, 2015 at 17:33

Dhiren Dash

335 bronze badges

Add a comment |

Stack Exchange Network

Find text between tab (\t) as a delimiter

5 Answers 5

TAB delimiter

cut

awk

Output in all cases:

Quote delimiter

cut

awk

Output in both cases:

You must log in to answer this question.

Hot Network Questions

Find text between tab (\t) as a delimiter

5 Answers 5

TAB delimiter

cut

awk

Output in all cases:

Quote delimiter

cut

awk

Output in both cases:

You must log in to answer this question.

Related

Hot Network Questions