Return to Revisions

1 of 6

asked May 7, 2017 at 2:34

Deconstructing one line into two lines based on specific columns

I have a tsv file (batch_1.catalog.tags.tsv) consisting over a million lines of 14 columns. I want to break each line into two.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs).

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Unix wizards, is this possible? Any help is greatly appreciated!

asked May 7, 2017 at 2:34

Age87