Revisions to Extracting names from file_a using information from 2 columns in file_b

added 1 character in body

Source Link

edited Jan 21, 2019 at 7:42

433
3
8

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) {
    # you can specify the "proximity" as required
    return val > lower - proximity && val < upper + proximity
}

BEGIN {
    OFS="\t"
}

$1 ~== id && within_range(pos, $4, $5, 100) {
    name = gensub(/.*Name=([^\t]*).*/, "\\1", 1)
    if (name ~ /[^[:space:]]+/)
        print id, pos, name
}

Then run

while read -r id pos
do
    awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) {
    # you can specify the "proximity" as required
    return val > lower - proximity && val < upper + proximity
}

BEGIN {
    OFS="\t"
}

$1 ~ id && within_range(pos, $4, $5, 100) {
    name = gensub(/.*Name=([^\t]*).*/, "\\1", 1)
    if (name ~ /[^[:space:]]+/)
        print id, pos, name
}

Then run

while read -r id pos
do
    awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) {
    # you can specify the "proximity" as required
    return val > lower - proximity && val < upper + proximity
}

BEGIN {
    OFS="\t"
}

$1 == id && within_range(pos, $4, $5, 100) {
    name = gensub(/.*Name=([^\t]*).*/, "\\1", 1)
    if (name ~ /[^[:space:]]+/)
        print id, pos, name
}

Then run

while read -r id pos
do
    awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

deleted 4 characters in body

Source Link

edited Jan 21, 2019 at 3:50

Niko Gambt

433
3
8

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) {
    # you can specify the "proximity" as required
    return val > lower - proximity && val < upper + proximity
}

BEGIN {
    OFS="\t"
}

$1 ~ id && within_range(pos, $4, $5, 100) {
    name = gensub(/.*Name=([^\t]*).*/, "\\1", 1)
    if (name ~ /[^[:space:]]+/)
        print id, pos, name
}

Then run

while read -r id pos
do
    awk -v id="${id}"id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) {
    # you can specify the "proximity" as required
    return val > lower - proximity && val < upper + proximity
}

BEGIN {
    OFS="\t"
}

$1 ~ id && within_range(pos, $4, $5, 100) {
    name = gensub(/.*Name=([^\t]*).*/, "\\1", 1)
    if (name ~ /[^[:space:]]+/)
        print id, pos, name
}

Then run

while read -r id pos
do
    awk -v id="${id}" -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) {
    # you can specify the "proximity" as required
    return val > lower - proximity && val < upper + proximity
}

BEGIN {
    OFS="\t"
}

$1 ~ id && within_range(pos, $4, $5, 100) {
    name = gensub(/.*Name=([^\t]*).*/, "\\1", 1)
    if (name ~ /[^[:space:]]+/)
        print id, pos, name
}

Then run

while read -r id pos
do
    awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

Source Link

answered Jan 21, 2019 at 3:21

Niko Gambt

433
3
8

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) {
    # you can specify the "proximity" as required
    return val > lower - proximity && val < upper + proximity
}

BEGIN {
    OFS="\t"
}

$1 ~ id && within_range(pos, $4, $5, 100) {
    name = gensub(/.*Name=([^\t]*).*/, "\\1", 1)
    if (name ~ /[^[:space:]]+/)
        print id, pos, name
}

Then run

while read -r id pos
do
    awk -v id="${id}" -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT  4050    mt-nd2
groupIII    7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375    bnip4
groupVII    17978350    si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

Stack Exchange Network

Return to Answer