How to compare two column of a file and a list and print not matching pattern with awk

Question

I have a data file A.tsv (field separator = \t) :

id  mutation
243 siti,toto,mumu
254     
267 lala,siti,sojo
289 lala

and a template file B.txt (field separator = not important because only one line and one column) :

lala,siti,mumu

I want to create a new column in A.tsv(but in a new file C.tsv) named mutation_not were are printed only the mutation present in the mutation column of A.tsv that are not present in the list of B.txt.

C.tsv looks like this:

id  mutation    mutation_not
243 siti,toto,mumu  toto
254     
267 lala,siti,sojo  sojo
289 lala

I tried with exclude:

awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file2 file1

but I don't have any good result. Do you have an idea? Thanks

αғsнιη · Accepted Answer · 2021-06-15 15:37:33Z

3

awk ' BEGIN{OFS="\t"}
NR==FNR{ for(i=1; i<=NF; i++) muts[$i]; next }
FNR>1  { len=split($2, tmp, ",");
         for(i=1; i<=len; i++) buf= buf (tmp[i] in muts?"":(buf==""?"":",") tmp[i])
       }
{ print $0, (FNR==1?"mutation_not":buf); buf="" }' FS=',' fileB FS='\t' fileA

edited Jun 15, 2021 at 15:37

answered Jun 15, 2021 at 10:40

αғsнιη

41.9k17 gold badges75 silver badges117 bronze badges

Add a comment |

Prabhjot Singh · Accepted Answer · 2021-06-15 14:06:51Z

Using gawk:

awk 'BEGIN{OFS="\t"; }
NR==FNR{ar[$1]=$1;next}
FNR==1{$(NF+1) = "mutation_not"}
FNR>1{split($2,a,","); 
for(i in a) if (a[i] in ar) ; 
else ncol[$1] = (ncol[$1])? ncol[$1] "," a[i] : a[i]; 
$(NF+1) = ncol[$1]}1' 
RS="," B.txt  RS="\n" FS="\t" A.tsv

Assuming all fields are separated by comma and have only one line, Record Separator(RS) is set to comma for file B.txt.

NR==FNR{ar[$1]=$1;next creates an array ar indexed on first field of first file.

FNR==1{$(NF+1) = "mutation_not" creates one more column in header name.

FNR>1{split($2,a,",") splits second field of A.tsv to an array a.

Next entry not present in B.txt is saved to ncol array. $(NF+1) = ncol[$1] creates one more column with elements of array ncol.

guest_7 · Accepted Answer · 2021-06-17 03:48:21Z

We will form a set s2 out of the comma-separated elements of the file B.txt

Then for each line of A.tsv we will convert the second field into a set and subtract the s2 set from it. This gets us the mutations present in A.tsv not found in B.txt. Then we join the resulting elements and print it along with the original line.

python3 -c 'import sys
tsv,txt = sys.argv[1:]
fs,rs = "\t","\n"
ofs,dlm = fs,","

with open(txt) as fh, open(tsv) as f:
  s2 = set(*list(map(lambda x:x.rstrip(rs).split(dlm),fh.readlines())))

  for nr,ln in enumerate(f,1):
    l = ln.rstrip(rs)
    if nr == 1: print(l,"mutation_not",sep=ofs)
    else:
      F = l.split(ofs)
      if len(F) < 2: print(l)
      else: print(l,
  dlm.join({*F[1].split(dlm)}-s2),sep=ofs)

' A.tsv B.txt

Result:

id  mutation    mutation_not
243 siti,toto,mumu  toto
254
267 lala,siti,sojo  sojo
289 lala

This time we will use the Gnu sed editor to get the results:

sed -Ee '
  1{h;d;}
  2s/\tmutation$/&&_not/;t

  s/\t\S+$/&&,/;T;G
  s/\t/\n/2;ta

  :a
  s/\n([^,]+),(.*\n(.*,)?\1(,|$))/\n\2/;ta
  s/\n([^,\n]+),/\t\1\n/;ta

  s/\n.*//
' B.txt A.tsv

Idea being that Btxt file is stored in hold (assuming it us one line) and each line of A.tsv is appended by the B.txt contents and the mutations are ticked off that are found in B.txt. After all mutations have been looked at the line is printed.

Stack Exchange Network

How to compare two column of a file and a list and print not matching pattern with awk

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

How to compare two column of a file and a list and print not matching pattern with awk

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions