Revisions to Extracting names from file_a using information from 2 columns in file_b

deleted 198 characters in body

Source Link

edited Jan 21, 2019 at 6:13

559
5
11

I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065)

MT  4050    mt-nd1nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

EXTRA (IF POSSIBLE): Some of the entries (of file_b) will not land within a gene but may be close to one, say 100 units away to either side. It would be nice to have seperate code which allows you to specify this proximity, as was attempted here: Extract names from File_B having overlapping intervals with File_A

Any help is VERY much appreciated!

I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b.

MT  4050    mt-nd1
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

EXTRA (IF POSSIBLE): Some of the entries (of file_b) will not land within a gene but may be close to one, say 100 units away to either side. It would be nice to have seperate code which allows you to specify this proximity, as was attempted here: Extract names from File_B having overlapping intervals with File_A

Any help is VERY much appreciated!

I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065)

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

edited tags

Link

edited Jan 21, 2019 at 5:58

Age87

559
5
11

deleted 165 characters in body

Source Link

edited Jan 21, 2019 at 5:41

Age87

559
5
11

One answer:

awk 'FNR==NR{a[$1]=$2;next} ($1 in a) && (a[$1]>=$4 && a[$1]<=$5){sub("Name=","",$10);print $1,a[$1],$10}'  file_b.tsv file_a.tsv > output.tsv

One answer:

awk 'FNR==NR{a[$1]=$2;next} ($1 in a) && (a[$1]>=$4 && a[$1]<=$5){sub("Name=","",$10);print $1,a[$1],$10}'  file_b.tsv file_a.tsv > output.tsv

added 169 characters in body

Source Link

edited Jan 21, 2019 at 4:34

Age87

559
5
11

Loading

added 11 characters in body

Source Link

edited Jan 21, 2019 at 1:14

Age87

559
5
11

Loading

Source Link

asked Jan 21, 2019 at 1:08

Age87

559
5
11

Loading

Stack Exchange Network

Return to Question