Sorry if this is too basic. I have a csv file where the columns have a header row (v1, v2, etc.). I understand that to extract columns 1 and 2, I have to do: awk -F "," '{print $1 "," $2}' infile.csv > outfile.csv. But what if I have to extract, say, columns 1 to 10, 20 to 25, and 30, 33? As an addendum, is there any way to extract directly with the header names rather than with column numbers?
9 Answers
I don't know if it's possible to do ranges in awk. You could do a for loop, but you would have to add handling to filter out the columns you don't want. It's probably easier to do this:
awk -F, '{OFS=",";print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$20,$21,$22,$23,$24,$25,$30,$33}' infile.csv > outfile.csv
something else to consider - and this faster and more concise:
cut -d "," -f1-10,20-25,30-33 infile.csv > outfile.csv
As to the second part of your question, I would probably write a script in perl that knows how to handle header rows, parsing the columns names from stdin or a file and then doing the filtering. It's probably a tool I would want to have for other things. I am not sure about doing in a one liner, although I am sure it can be done.
3 Comments
cut, but Cliff got here first.As mentioned by @Tom, the cut and awk approaches actually don't work for CSVs with quoted strings. An alternative is a module for python that provides the command line tool csvfilter. It works like cut, but properly handles CSV column quoting:
csvfilter -f 1,3,5 in.csv > out.csv
If you have python (and you should), you can install it simply like this:
pip install csvfilter
Please take note that the column indexing in csvfilter starts with 0 (unlike awk, which starts with $1). More info at https://github.com/codeinthehole/csvfilter/
Comments
Other languages have short cuts for ranges of field numbers, but not awk, you'll have to write your code as your fear ;-)
awk -F, 'BEGIN {OFS=","} { print $1, $2, $3, $4 ..... $30, $33}' infile.csv > outfile.csv
There is no direct function in awk to use field names as column specifiers.
I hope this helps.
1 Comment
Others have answered your earlier question. For this:
As an addendum, is there any way to extract directly with the header names rather than with column numbers?
I haven't tried it, but you could store each header's index in a hash and then use that hash to get its index later on.
for(i=0;i<$NF;i++){
hash[$i] = i;
}
Then later on, use it:
j = hash["header1"];
print $j;
Comments
Tabulator is a set of unix command line tools to work with csv files that have header lines. Here is an example to extract columns by name from a file test.csv:
name,sex,house_nr,height,shoe_size
arthur,m,42,181,11.5
berta,f,101,163,8.5
chris,m,1333,175,10
don,m,77,185,12.5
elisa,f,204,166,7
Then tblmap -k name,height test.csv produces
name,height
arthur,181
berta,163
chris,175
don,185
elisa,166
Comments
If Perl is an option:
perl -F, -lane 'print join ",",@F[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32]'
-a autosplits line into @F fields array. Indices start at 0 (not 1 as in awk)
-F, field separator is ,
If your CSV file contains commas within quotes, fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();print (join ",",@f[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32])}'
I provided more explanation within my answer here: parse csv file using gawk
Comments
Not using awk but the simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.
csvtool format '%(2)\n' input.csv
csvtool format '%(2),%(3),%(4)\n' input.csv
Replacing 2 with the column number will effectively extract the column data you are looking for.
Comments
You can pass the columns whose values you want to use from outside of AWK.
Eg, using GNU Awk 3.1.7
This Code
echo -e "one,two,three,four,five\none,two,three,four,five" | awk -F"," -v kfields="1_3_5" '
BEGIN {
arrayMax=split(kfields, arrKeys, "_");
}
{
outString="";
for (idx = 1; idx <= arrayMax ; idx++) {
outString=outString$arrKeys[idx];
}
print "outString:"outString;
print "-----------";
}
'
This will output only the columns you specify eg these values mean only output fields 1 and 3 and 5.
kfields="1_3_5"
Eg the output
outString:onethreefive
-----------
outString:onethreefive
-----------