Skip to main content
deleted 198 characters in body
Source Link
Age87
  • 559
  • 5
  • 11

I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065)

MT  4050    mt-nd1nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

EXTRA (IF POSSIBLE): Some of the entries (of file_b) will not land within a gene but may be close to one, say 100 units away to either side. It would be nice to have seperate code which allows you to specify this proximity, as was attempted here: Extract names from File_B having overlapping intervals with File_A

Any help is VERY much appreciated!

I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b.

MT  4050    mt-nd1
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

EXTRA (IF POSSIBLE): Some of the entries (of file_b) will not land within a gene but may be close to one, say 100 units away to either side. It would be nice to have seperate code which allows you to specify this proximity, as was attempted here: Extract names from File_B having overlapping intervals with File_A

Any help is VERY much appreciated!

I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065)

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4
edited tags
Link
Age87
  • 559
  • 5
  • 11
deleted 165 characters in body
Source Link
Age87
  • 559
  • 5
  • 11

One answer:

awk 'FNR==NR{a[$1]=$2;next} ($1 in a) && (a[$1]>=$4 && a[$1]<=$5){sub("Name=","",$10);print $1,a[$1],$10}'  file_b.tsv file_a.tsv > output.tsv

One answer:

awk 'FNR==NR{a[$1]=$2;next} ($1 in a) && (a[$1]>=$4 && a[$1]<=$5){sub("Name=","",$10);print $1,a[$1],$10}'  file_b.tsv file_a.tsv > output.tsv
added 169 characters in body
Source Link
Age87
  • 559
  • 5
  • 11
Loading
added 11 characters in body
Source Link
Age87
  • 559
  • 5
  • 11
Loading
Source Link
Age87
  • 559
  • 5
  • 11
Loading