Revisions to Deconstructing one line into two lines based on specific columns

Stylistic changes to highlight code.

Source Link

edit approved May 7, 2017 at 20:32

231
2
10

I have a tsv.tsv file (batch_1.catalog.tags.tsvbatch_1.catalog.tags.tsv) consisting of 1,965,056 lines of 14 columns. I want to break some of these into two lines.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the tsv.tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txtwhitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs). The whitelist.txtwhitelist.txt includes numbers of up to 6 digits.

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the tsv.tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.facat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Both solutions below using awk or perl work perfectly. The IDs are also printed out in order although they were not in order in the whitelist. The perl solution prints the lines tab-delimited while awk prints them space-delimteddelimited.

I have a tsv file (batch_1.catalog.tags.tsv) consisting of 1,965,056 lines of 14 columns. I want to break some of these into two lines.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs). The whitelist.txt includes numbers of up to 6 digits.

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Both solutions below using awk or perl work perfectly. The IDs are also printed out in order although they were not in order in the whitelist. The perl solution prints the lines tab-delimited while awk prints them space-delimted.

I have a .tsv file (batch_1.catalog.tags.tsv) consisting of 1,965,056 lines of 14 columns. I want to break some of these into two lines.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the .tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs). The whitelist.txt includes numbers of up to 6 digits.

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the .tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Both solutions below using awk or perl work perfectly. The IDs are also printed out in order although they were not in order in the whitelist. The perl solution prints the lines tab-delimited while awk prints them space-delimited.

added 172 characters in body

Source Link

edited May 7, 2017 at 20:07

Age87

559
5
11

I have a tsv file (batch_1.catalog.tags.tsv) consisting of 1,965,056 lines of 14 columns. I want to break some of these into two lines.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs). The whitelist.txt includes numbers of up to 6 digits.

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Unix wizards, is this possible? Any help is greatly appreciated!Both solutions below using awk or perl work perfectly. The IDs are also printed out in order although they were not in order in the whitelist. The perl solution prints the lines tab-delimited while awk prints them space-delimted.

I have a tsv file (batch_1.catalog.tags.tsv) consisting of 1,965,056 lines of 14 columns. I want to break some of these into two lines.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs). The whitelist.txt includes numbers of up to 6 digits.

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Unix wizards, is this possible? Any help is greatly appreciated!

I have a tsv file (batch_1.catalog.tags.tsv) consisting of 1,965,056 lines of 14 columns. I want to break some of these into two lines.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs). The whitelist.txt includes numbers of up to 6 digits.

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Both solutions below using awk or perl work perfectly. The IDs are also printed out in order although they were not in order in the whitelist. The perl solution prints the lines tab-delimited while awk prints them space-delimted.

added 9 characters in body

Source Link

edited May 7, 2017 at 19:53

Age87

559
5
11

I have a tsv file (batch_1.catalog.tags.tsv) consisting over a million of lines1,965,056 lines of 14 columns. I want to break each linesome of these into two lines.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs). The whitelist.txt includes numbers of up to 6 digits.

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Unix wizards, is this possible? Any help is greatly appreciated!

I have a tsv file (batch_1.catalog.tags.tsv) consisting over a million lines of 14 columns. I want to break each line into two.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs). The whitelist.txt includes numbers of up to 6 digits.

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Unix wizards, is this possible? Any help is greatly appreciated!

I have a tsv file (batch_1.catalog.tags.tsv) consisting of 1,965,056 lines of 14 columns. I want to break some of these into two lines.

The first line: starts with a greater than sign (>) followed by 8 of the 14 columns
The second line: only column 10

Eg.

>column3(a number) column4(numbers and letters) column5(a number) column6(- or +) column11(0 or 1) column12(0 or 1) column13(0 or 1) column14(0 or 1)       
column10(string with As,Ts,Gs,Cs, and sometimes Ns)

Here is an example of the sixth line of the tsv file, as specified by the third column:

0   1   6   gi|586799556|ref|NW_006530744.1|    141 +   consensus   0   1_33,14_43  CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC    0   0   0   0

This is what I would like:

>6 gi|586799556|ref|NW_006530744.1| 141 +  0 0 0 0        
CGGGCGGTGGTGGCGCACGCCTTTAATCCCAGCACTTGGGAGGCAGAGGCAGGTGGATCTTTGTGAGTTCGAGGCCAGCCTGGGCTACCAAGTGAGCTCC

However, I only want to do this to lines in the tsv file (batch_1.catalog.tags.tsv) that have a third-column number that matches the numbers in a different text file (whitelist.txt).

In the above example, the whitelist.txt file would contain the number 6, although there are 8000+ more lines with different third-column numbers (i.e. IDs). The whitelist.txt includes numbers of up to 6 digits.

I have been trying an alternative approach. I was given the code below for using the whitelist to pull out column 10 from the tsv file. However, grep went on for 10 hours and didn't do anything (empty cat.fa file).

cat whitelist.txt | while read line; do zgrep "^0    1       $line   " batch_1.catalog.tags.tsv.gz; done | cut -f 3,10 | sed -E -e's/^([0-9]+)       ([ACGTN]+)$/>\1Z\2/' | tr "Z" "\n" > cat.fa

Unix wizards, is this possible? Any help is greatly appreciated!

added 55 characters in body

Source Link

edited May 7, 2017 at 19:44

Age87

559
5
11

Loading

edited tags

Link

edited May 7, 2017 at 7:28

Kusalananda ♦

355.8k
42
735
1.1k

Loading

Source Link

asked May 7, 2017 at 2:34

Age87

559
5
11

Loading

Stack Exchange Network

Return to Question