Extract specific columns from delimited file using Awk

Question

Sorry if this is too basic. I have a csv file where the columns have a header row (v1, v2, etc.). I understand that to extract columns 1 and 2, I have to do: awk -F "," '{print $1 "," $2}' infile.csv > outfile.csv. But what if I have to extract, say, columns 1 to 10, 20 to 25, and 30, 33? As an addendum, is there any way to extract directly with the header names rather than with column numbers?

Cliff · Accepted Answer · 2011-10-22 03:09:18Z

78

I don't know if it's possible to do ranges in awk. You could do a for loop, but you would have to add handling to filter out the columns you don't want. It's probably easier to do this:

awk -F, '{OFS=",";print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$20,$21,$22,$23,$24,$25,$30,$33}' infile.csv > outfile.csv

something else to consider - and this faster and more concise:

cut -d "," -f1-10,20-25,30-33 infile.csv > outfile.csv

As to the second part of your question, I would probably write a script in perl that knows how to handle header rows, parsing the columns names from stdin or a file and then doing the filtering. It's probably a tool I would want to have for other things. I am not sure about doing in a one liner, although I am sure it can be done.

edited Oct 22, 2011 at 3:09

answered Oct 22, 2011 at 3:00

Cliff

1,7211 gold badge14 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user702432 Over a year ago

Many thanks. Cut is what I need, I guess. This wouldn't work with the headers, by any chance?

ObscureRobot Over a year ago

I was just about to suggest cut, but Cliff got here first.

Tom Morris Over a year ago

Note that in the general case of a CSV file with quoted strings, you can have non-delimiting commas in the data fields, which will cause the cut & awk solutions to fail.

Brian Canada · Accepted Answer · 2016-03-04 11:49:15Z

As mentioned by @Tom, the cut and awk approaches actually don't work for CSVs with quoted strings. An alternative is a module for python that provides the command line tool csvfilter. It works like cut, but properly handles CSV column quoting:

csvfilter -f 1,3,5 in.csv > out.csv

If you have python (and you should), you can install it simply like this:

pip install csvfilter

Please take note that the column indexing in csvfilter starts with 0 (unlike awk, which starts with $1). More info at https://github.com/codeinthehole/csvfilter/

shellter · Accepted Answer · 2011-10-22 03:05:34Z

4

Other languages have short cuts for ranges of field numbers, but not awk, you'll have to write your code as your fear ;-)

awk -F, 'BEGIN {OFS=","} { print $1, $2, $3, $4 ..... $30, $33}' infile.csv > outfile.csv

There is no direct function in awk to use field names as column specifiers.

I hope this helps.

answered Oct 22, 2011 at 3:05

shellter

37.6k7 gold badges87 silver badges96 bronze badges

1 Comment

user702432 Over a year ago

It does indeed. Thanks for the confirmation :-(

Raymond Hettinger · Accepted Answer · 2011-10-22 06:11:47Z

4

You can use a for-loop to address a field with $i:

ls -l | awk '{for(i=3 ; i<8 ; i++) {printf("%s\t", $i)} print ""}'

answered Oct 22, 2011 at 6:11

Raymond Hettinger

229k67 gold badges405 silver badges504 bronze badges

Comments

Ritesh · Accepted Answer · 2011-10-25 04:40:28Z

3

Others have answered your earlier question. For this:

As an addendum, is there any way to extract directly with the header names rather than with column numbers?

I haven't tried it, but you could store each header's index in a hash and then use that hash to get its index later on.

for(i=0;i<$NF;i++){
    hash[$i] = i;
}

Then later on, use it:

j = hash["header1"];
print $j;

answered Oct 25, 2011 at 4:40

Ritesh

5561 gold badge5 silver badges11 bronze badges

Comments

stefan.schroedl · Accepted Answer · 2015-04-04 07:52:40Z

2

Tabulator is a set of unix command line tools to work with csv files that have header lines. Here is an example to extract columns by name from a file test.csv:

name,sex,house_nr,height,shoe_size
arthur,m,42,181,11.5
berta,f,101,163,8.5
chris,m,1333,175,10
don,m,77,185,12.5
elisa,f,204,166,7

Then tblmap -k name,height test.csv produces

name,height
arthur,181
berta,163
chris,175
don,185
elisa,166

answered Apr 4, 2015 at 7:52

stefan.schroedl

8669 silver badges20 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:17:44Z

If Perl is an option:

perl -F, -lane 'print join ",",@F[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32]'

-a autosplits line into @F fields array. Indices start at 0 (not 1 as in awk)
-F, field separator is ,

If your CSV file contains commas within quotes, fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();print (join ",",@f[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32])}'

I provided more explanation within my answer here: parse csv file using gawk

Samar · Accepted Answer · 2016-10-25 18:39:27Z

1

Not using awk but the simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.

csvtool format '%(2)\n' input.csv
csvtool format '%(2),%(3),%(4)\n' input.csv

Replacing 2 with the column number will effectively extract the column data you are looking for.

answered Oct 25, 2016 at 18:39

Samar

1,9651 gold badge12 silver badges13 bronze badges

Comments

A. K. · Accepted Answer · 2024-06-06 13:48:17Z

You can pass the columns whose values you want to use from outside of AWK.

Eg, using GNU Awk 3.1.7

This Code

echo -e "one,two,three,four,five\none,two,three,four,five" | awk -F"," -v kfields="1_3_5" '
BEGIN {
   arrayMax=split(kfields, arrKeys, "_");
}
{
   outString="";
   for (idx = 1; idx <= arrayMax ; idx++) {
     outString=outString$arrKeys[idx];
   }
   print "outString:"outString;
   print "-----------";
}
'

This will output only the columns you specify eg these values mean only output fields 1 and 3 and 5.

kfields="1_3_5"

Eg the output

outString:onethreefive
-----------
outString:onethreefive
-----------

Collectives™ on Stack Overflow

Extract specific columns from delimited file using Awk

9 Answers 9

3 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

3 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related