3

I have two files:

file1 (search):

1  
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGTTTTTTTTT  
2  
GACGGAGGATGCAAGTGTTATCCGGAAT  
3  
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGCGTCC  
4  
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCCGTCCGTAG 

file2 (patterns):

GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGC
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGCG
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGCGTCC
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGCGTCCGTAG 

What I need is:

1  
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGTTTTTTTTT  
2  
3  
4  
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCCGTCCGTAG  

Which means, that I need a script that searches my search-lines in the patterns-file, and if the search-lines can be found fully or partially (as for the line following "2") then they should not be written, but all other lines of file 1.

I tried many of the grep and awk scripts found in that forum but none did what I need, e.g.

awk 'FN==NR {exclude[$0];next} !($0 in exclude)' file2 file1  

or

awk 'NR==FNR{a[$1]++;next} !($1 in a) {print $1} {next}' file2 file1

I also tried grep.

Anyway, all tried scripts just find the fully matching patterns but not the partially matching ones...

Has anyone have an idea??

7
  • 1
    This looks like fasta files. How big are the files? This matters for the solution. If there is too much data, you can't simply use the DNA as keys in an associative array. Commented Mar 12, 2018 at 9:53
  • One of the files is fasta format, the other one is text with each line having a different pattern. The file in fasta format is very large (130 million reads, 260 million lines), the pattern file contains 2000 different patterns. Commented Mar 12, 2018 at 10:02
  • Do the sequences in the fasta (search) file span multiple lines, or are they all short sequences on single lines? Commented Mar 12, 2018 at 10:05
  • The sequences in the fasta format are up to 260 characters (nucleotides) long. Commented Mar 12, 2018 at 10:08
  • 1
    to 1) the fasta file starts with > but the test file not just starts with the numbers (I could have been more correct here, sorry) ; to 2) just one sequence per line and no multiple lines Commented Mar 12, 2018 at 11:22

3 Answers 3

2

Using an awk script:

NR == FNR       { seq[++n] = $1; next }

{
    header = $0
    getline

    for (i = 1; i <= n; ++i) {
        if (match(seq[i], $0) > 0) {
            print header
            next
        }
    }

    print header
    print
}

Running it:

$ awk -f script.awk file2 file1
1
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGTTTTTTTTT
2
3
4
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCCGTCCGTAG

The script first reads the 2000 sequences from file2 into the array seq, and then reads the header from file1 followed by the sequence from file1 (using getline). It then goes through the array seq to find a sequence that contains the current sequence from file1. If such a sequence is found, the header is printed and the script continues with the next line in file1. Otherwise, both header and sequence are printed.

The script assumes that the file1 contains alternating header and sequence lines, and absolutely no multi-line sequence.

The following is the same script but using DRY principle ("Don't repeat yourself"):

NR == FNR       { seq[++n] = $1; next }

{
    header = $0
    getline

    found = 0
    for (i = 1; i <= n; ++i) {
        if (match(seq[i], $0) > 0) {
            found = 1
            break
        }
    }

    print header
    if (!found) print
}
0

awk + grep solution:

awk '/^[^>0-9]/{ 
         m = ""; cmd = sprintf("grep -m1 %s file2", $0);
         cmd | getline m; close(cmd);
         if (m) next; 
     }1' file1
  • cmd - crucial grep command
  • m - variable filled with possible matched item

The output:

1
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGTTTTTTTTT
2
3
4
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCCGTCCGTAG
11
  • This results in the same results which I got with all other scripts before: Commented Mar 12, 2018 at 10:27
  • GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGTTTTTTTTT 2 GACGGAGGATGCAAGTGTTATCCGGAAT 3 4 GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCCGTCCGTAG Commented Mar 12, 2018 at 10:28
  • @MSt, your block named What I need is: contains the output as this approach gives. Otherwise, you've posted non-actual data Commented Mar 12, 2018 at 10:28
  • ... the sequence after >2 is still in my output and needs to get removed as it is part of the pattern. Commented Mar 12, 2018 at 10:29
  • @MSt, still in my output - if your output differs from mine - then you did something wrong or posted non-actual data. The output I've posted is from a real test case Commented Mar 12, 2018 at 10:30
0
grep -vxf <(grep -of file_1 file_2 | sort -u) file_1

file_1

1
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGTTTTTTTTT
2
GACGGAGGATGCAAGTGTTATCCGGAAT
3
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGCGTCC
4
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCCGTCCGTAG

file_2

GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGC
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGCG
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGCGTCC
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGCGTCCGTAG

Output

1
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGTTTTTTTTT
2
3
4
GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCCGTCCGTAG
2
  • I have no clue at all but I tried your and all other provided scripts on different computers but my output still looks like: <br/> 1 <br/> GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCGTAAAGTTTTTTTTT <br/> 2 <br/> GACGGAGGATGCAAGTGTTATCCGGAAT <br/> 3 <br/> 4 <br/> GACGGAGGATGCAAGTGTTATCCGGAATCACTGGGCCGTCCGTAG <br/>????????? Commented Mar 14, 2018 at 8:54
  • @MSt Maybe because of <br> tag? They don't present in your sample files in the question, so you and we use different input for testing. Try to remove all tags before checking. The contents of the files should be exactly the same as in the question, then the proposed solutions should work correctly. Commented Mar 14, 2018 at 16:18

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.