Revisions to Replace new lines with spaces using awk

added 668 characters in body

Source Link

edited Feb 26, 2024 at 11:05

35.9k
6
25
60

If you need the output lines also sorted then you'd need to use GNU awk for PROCINFO["sorted_in"]:

$ awk '{a[$0]} END{PROCINFO["sorted_in"]="@ind_str_asc"; for (i in a) printf "%s%s", i, (++n % 2 ? "\t" : RS) }' file1
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

but, just like the solution that uses sort, that wouldn't produce the presumably expected order when the numbers in the input can be multiple digits because, for example, A11 would sort alphabetically before A2 - you'd need to split each string up into separate alphabetic and numeric parts and sort each separately or normalize them to always have the same number oc alphabetic and numeric characters in each position, e.g. map A1_R1 into 000A0001_000R0001 or similar before sorting.

If you need the output lines also sorted then you'd need to use GNU awk for PROCINFO["sorted_in"]:

$ awk '{a[$0]} END{PROCINFO["sorted_in"]="@ind_str_asc"; for (i in a) printf "%s%s", i, (++n % 2 ? "\t" : RS) }' file1
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

but, just like the solution that uses sort, that wouldn't produce the presumably expected order when the numbers in the input can be multiple digits because, for example, A11 would sort alphabetically before A2 - you'd need to split each string up into separate alphabetic and numeric parts and sort each separately or normalize them to always have the same number oc alphabetic and numeric characters in each position, e.g. map A1_R1 into 000A0001_000R0001 or similar before sorting.

added 32 characters in body

Source Link

edited Feb 26, 2024 at 10:50

Ed Morton

35.9k
6
25
60

For the input you show where all the paired lines are next to each other all you need with any awk is:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' file
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or if they aren't paired already:

$ shuf file > file1
$ cat file1
A3_R2.fastq.gz
A2_R2.fastq.gz
A1_R1.fastq.gz
A3_R1.fastq.gz
A1_R2.fastq.gz
A2_R1.fastq.gz

and so need to be paired then if you don't mind adding a call to sort:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' <(sort file1)
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or to pair them within awk:

$ awk -F'_' -v OFS='\t' '$1 in a{print a[$1], $0; next} {a[$1]=$0}' file1
A3_R2.fastq.gz  A3_R1.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R2.fastq.gz  A2_R1.fastq.gz

Note in that last script that the R2 field is output before it's R1 partner in some cases. If that's an issue then you can order them when printing:

$ awk -F'_' -v OFS='\t' '$1'
    $1 in a { print (a[$1] < $0 ? a[$1] OFS $0 : $0 OFS a[$1]); next }
    {a[$1]=$0 a[$1] = $0 } 
' file1
A3_R1.fastq.gz  A3_R2.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz

If your input file is actually many millions of lines long then adding delete a[$1]; before the next would speed up execution time in most cases, probably not worthwhile if it's just a few thousand (trading off the overhead of calling delete a[$1] for every pair vs the overhead of having a large hash table a[]).

For the input you show where all the paired lines are next to each other all you need with any awk is:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' file
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or if they aren't paired already:

$ shuf file > file1
$ cat file1
A3_R2.fastq.gz
A2_R2.fastq.gz
A1_R1.fastq.gz
A3_R1.fastq.gz
A1_R2.fastq.gz
A2_R1.fastq.gz

and so need to be paired then if you don't mind adding a call to sort:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' <(sort file1)
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or to pair them within awk:

$ awk -F'_' -v OFS='\t' '$1 in a{print a[$1], $0; next} {a[$1]=$0}' file1
A3_R2.fastq.gz  A3_R1.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R2.fastq.gz  A2_R1.fastq.gz

Note in that last script that the R2 field is output before it's R1 partner in some cases. If that's an issue then you can order them when printing:

$ awk -F'_' -v OFS='\t' '$1 in a{print (a[$1] < $0 ? a[$1] OFS $0 : $0 OFS a[$1]); next} {a[$1]=$0}' file1
A3_R1.fastq.gz  A3_R2.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz

If your input file is actually many millions of lines long then adding delete a[$1]; before the next would speed up execution time in most cases, probably not worthwhile if it's just a few thousand.

For the input you show where all the paired lines are next to each other all you need with any awk is:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' file
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or if they aren't paired already:

$ shuf file > file1
$ cat file1
A3_R2.fastq.gz
A2_R2.fastq.gz
A1_R1.fastq.gz
A3_R1.fastq.gz
A1_R2.fastq.gz
A2_R1.fastq.gz

and so need to be paired then if you don't mind adding a call to sort:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' <(sort file1)
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or to pair them within awk:

$ awk -F'_' -v OFS='\t' '$1 in a{print a[$1], $0; next} {a[$1]=$0}' file1
A3_R2.fastq.gz  A3_R1.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R2.fastq.gz  A2_R1.fastq.gz

Note in that last script that the R2 field is output before it's R1 partner in some cases. If that's an issue then you can order them when printing:

$ awk -F'_' -v OFS='\t' '
    $1 in a { print (a[$1] < $0 ? a[$1] OFS $0 : $0 OFS a[$1]); next }
    { a[$1] = $0 } 
' file1
A3_R1.fastq.gz  A3_R2.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz

If your input file is actually many millions of lines long then adding delete a[$1]; before the next would speed up execution time in most cases, probably not worthwhile if it's just a few thousand (trading off the overhead of calling delete a[$1] for every pair vs the overhead of having a large hash table a[]).

deleted 70 characters in body

Source Link

edited Feb 26, 2024 at 10:44

Ed Morton

35.9k
6
25
60

For the input you show where all the paired lines are next to each other all you need with any awk is:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' file
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or if they aren't paired already:

$ shuf file > file1
$ cat file1
A3_R2.fastq.gz
A2_R2.fastq.gz
A1_R1.fastq.gz
A3_R1.fastq.gz
A1_R2.fastq.gz
A2_R1.fastq.gz

and so need to be paired then if you don't mind adding a call to sort:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' <(sort file1)
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or to pair them within awk:

$ awk -F'_' -v OFS='\t' '$1 in a{print a[$1], $0; next} {a[$1]=$0}' file1
A3_R2.fastq.gz  A3_R1.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R2.fastq.gz  A2_R1.fastq.gz

Note in that last script that the R2 field is output before it's R1 partner in some cases. If that's an issue then you can order them when printing:

$ awk -F'_' -v OFS='\t' '$1 in a{print (a[$1] < $0 ? a[$1] OFS $0 : $0 OFS a[$1]); next} {a[$1]=$0}' file1
A3_R1.fastq.gz  A3_R2.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz

If your input file is actually many millions of lines long then adding delete a[$1]; before the next would speed up execution time in most cases, probably not worthwhile if it's just a few thousand.

For the input you show where all the paired lines are next to each other all you need with any awk is:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' file
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or if they aren't paired already:

$ shuf file > file1
$ cat file1
A3_R2.fastq.gz
A2_R2.fastq.gz
A1_R1.fastq.gz
A3_R1.fastq.gz
A1_R2.fastq.gz
A2_R1.fastq.gz

and so need to be paired then if you don't mind adding a call to sort:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' <(sort file1)
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or to pair them within awk:

$ awk -F'_' -v OFS='\t' '$1 in a{print a[$1], $0; next} {a[$1]=$0}' file1
A3_R2.fastq.gz  A3_R1.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R2.fastq.gz  A2_R1.fastq.gz

For the input you show where all the paired lines are next to each other all you need with any awk is:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' file
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or if they aren't paired already:

$ shuf file > file1
$ cat file1
A3_R2.fastq.gz
A2_R2.fastq.gz
A1_R1.fastq.gz
A3_R1.fastq.gz
A1_R2.fastq.gz
A2_R1.fastq.gz

and so need to be paired then if you don't mind adding a call to sort:

$ awk '{ORS=(NR%2 ? "\t" : RS)} 1' <(sort file1)
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz
A3_R1.fastq.gz  A3_R2.fastq.gz

or to pair them within awk:

$ awk -F'_' -v OFS='\t' '$1 in a{print a[$1], $0; next} {a[$1]=$0}' file1
A3_R2.fastq.gz  A3_R1.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R2.fastq.gz  A2_R1.fastq.gz

Note in that last script that the R2 field is output before it's R1 partner in some cases. If that's an issue then you can order them when printing:

$ awk -F'_' -v OFS='\t' '$1 in a{print (a[$1] < $0 ? a[$1] OFS $0 : $0 OFS a[$1]); next} {a[$1]=$0}' file1
A3_R1.fastq.gz  A3_R2.fastq.gz
A1_R1.fastq.gz  A1_R2.fastq.gz
A2_R1.fastq.gz  A2_R2.fastq.gz

If your input file is actually many millions of lines long then adding delete a[$1]; before the next would speed up execution time in most cases, probably not worthwhile if it's just a few thousand.

deleted 70 characters in body

Source Link

edited Feb 26, 2024 at 10:36

Ed Morton

35.9k
6
25
60

Loading

added 689 characters in body

Source Link

edited Feb 26, 2024 at 10:30

Ed Morton

35.9k
6
25
60

Loading

Source Link

answered Feb 26, 2024 at 10:22

Ed Morton

35.9k
6
25
60

Loading

Stack Exchange Network

Return to Answer