Return to Answer

added 532 characters in body

Source Link

edited Jun 17, 2021 at 3:48

5.8k
1
8
13

We will form a set s2 out of the comma-separated elements of the file B.txt

Then for each line of A.tsv we will convert the second field into a set and subtract the s2 set from it. This gets us the mutations present in A.tsv not found in B.txt. Then we join the resulting elements and print it along with the original line.

python3 -c 'import sys
tsv,txt = sys.argv[1:]
fs,rs = "\t","\n"
ofs,dlm = fs,","

with open(txt) as fh, open(tsv) as f:
  s2 = set(*list(map(lambda x:x.rstrip(rs).split(dlm),fh.readlines())))

  for nr,ln in enumerate(f,1):
    l = ln.rstrip(rs)
    if nr == 1: print(l,"mutation_not",sep=ofs)
    else:
      F = l.split(ofs)
      if len(F) < 2: print(l)
      else: print(l,
  dlm.join({*F[1].split(dlm)}-s2),sep=ofs)

' A.tsv B.txt

Result:

id  mutation    mutation_not
243 siti,toto,mumu  toto
254
267 lala,siti,sojo  sojo
289 lala

This time we will use the Gnu sed editor to get the results:

sed -Ee '
  1{h;d;}
  2s/\tmutation$/&&_not/;t

  s/\t\S+$/&&,/;T;G
  s/\t/\n/2;ta

  :a
  s/\n([^,]+),(.*\n(.*,)?\1(,|$))/\n\2/;ta
  s/\n([^,\n]+),/\t\1\n/;ta

  s/\n.*//
' B.txt A.tsv

Idea being that Btxt file is stored in hold (assuming it us one line) and each line of A.tsv is appended by the B.txt contents and the mutations are ticked off that are found in B.txt. After all mutations have been looked at the line is printed.

We will form a set s2 out of the comma-separated elements of the file B.txt

python3 -c 'import sys
tsv,txt = sys.argv[1:]
fs,rs = "\t","\n"
ofs,dlm = fs,","

with open(txt) as fh, open(tsv) as f:
  s2 = set(*list(map(lambda x:x.rstrip(rs).split(dlm),fh.readlines())))

  for nr,ln in enumerate(f,1):
    l = ln.rstrip(rs)
    if nr == 1: print(l,"mutation_not",sep=ofs)
    else:
      F = l.split(ofs)
      if len(F) < 2: print(l)
      else: print(l,
  dlm.join({*F[1].split(dlm)}-s2),sep=ofs)

' A.tsv B.txt

Result:

id  mutation    mutation_not
243 siti,toto,mumu  toto
254
267 lala,siti,sojo  sojo
289 lala

We will form a set s2 out of the comma-separated elements of the file B.txt

python3 -c 'import sys
tsv,txt = sys.argv[1:]
fs,rs = "\t","\n"
ofs,dlm = fs,","

with open(txt) as fh, open(tsv) as f:
  s2 = set(*list(map(lambda x:x.rstrip(rs).split(dlm),fh.readlines())))

  for nr,ln in enumerate(f,1):
    l = ln.rstrip(rs)
    if nr == 1: print(l,"mutation_not",sep=ofs)
    else:
      F = l.split(ofs)
      if len(F) < 2: print(l)
      else: print(l,
  dlm.join({*F[1].split(dlm)}-s2),sep=ofs)

' A.tsv B.txt

Result:

id  mutation    mutation_not
243 siti,toto,mumu  toto
254
267 lala,siti,sojo  sojo
289 lala

This time we will use the Gnu sed editor to get the results:

sed -Ee '
  1{h;d;}
  2s/\tmutation$/&&_not/;t

  s/\t\S+$/&&,/;T;G
  s/\t/\n/2;ta

  :a
  s/\n([^,]+),(.*\n(.*,)?\1(,|$))/\n\2/;ta
  s/\n([^,\n]+),/\t\1\n/;ta

  s/\n.*//
' B.txt A.tsv

Source Link

answered Jun 16, 2021 at 8:09

guest_7

5.8k
1
8
13

We will form a set s2 out of the comma-separated elements of the file B.txt

python3 -c 'import sys
tsv,txt = sys.argv[1:]
fs,rs = "\t","\n"
ofs,dlm = fs,","

with open(txt) as fh, open(tsv) as f:
  s2 = set(*list(map(lambda x:x.rstrip(rs).split(dlm),fh.readlines())))

  for nr,ln in enumerate(f,1):
    l = ln.rstrip(rs)
    if nr == 1: print(l,"mutation_not",sep=ofs)
    else:
      F = l.split(ofs)
      if len(F) < 2: print(l)
      else: print(l,
  dlm.join({*F[1].split(dlm)}-s2),sep=ofs)

' A.tsv B.txt

Result:

id  mutation    mutation_not
243 siti,toto,mumu  toto
254
267 lala,siti,sojo  sojo
289 lala