0

I have two data files containing sets of 4 lines. I need to extract the sets of 4 lines contained in the second file if part of the 1st line of every set matches.

Here is an example of input data:

input1.txt
@abcde:134/1
JDOIJDEJAKJ
content1
content2

input2.txt
@abcde:134/2
JKDJFLJSIEF
content3
content4
@abcde:135/2
KFJKDJFKLDJ
content5
content6

Here is what the output should look like:

output.txt
@abcde:134/2
JKDJFLJSIEF
content3
content4

Here is my attempt at writing code...

import sys

filename1 = sys.argv[1] #input1.txt
filename2 = sys.argv[2] #input2.txt

F = open(filename1, 'r')
R = open(filename2, 'r')

def output(input1, input2):
    for line in input1:
        if "@" in line:
            for line2 in input2:
                if line[:-1] in line2:
                    for i in range(4):
                        print next(input2)

output = output(F, R)
write(output)

I get invalid syntax for next() which I can't figure out, and I would be happy if someone could correct my code or give me tips on how to make this work.

===EDIT=== OK, I think I have managed to implement the solutions proposed in the comments below (thank you). I am now running the code on a Terminal session connected by ssh to a remote Ubuntu server. Here is what the code looks like now. (This time I am running python2.7)

filename1 = sys.argv[1] #input file 1
filename2 = sys.argv[2] #input file 2 (some lines of which will be in the output)

F = open(filename1, 'r')
R = open(filename2, 'r')

def output(input1, input2):
    for line in input1:
        input2.seek(0)
        if "@" in line:
            for line2 in input2:
                if line[:-2] in line2:
                    for i in range(4):
                        out = next(input2)
                        print out
                        return

output (F, R)

Then I run this command:

python fetch_reverse.py test1.fq test.fq > test2.fq

I don't get any warnings, but the output file is empty. What am I doing wrong?

8
  • 1
    Are you using python3? In this case print is a function and requires the parenthesis: print(next(reverse)). Note that this works even in python2. Commented Jan 13, 2014 at 9:03
  • 1
    Note that your function output() doesn't return anything, and that you then try to shadow its name in calling it. You will also need to store your results in some container, pass it back to the caller and rename the variable before this will work at all. Commented Jan 13, 2014 at 9:05
  • Another thing to note: You are looping over input1 once, but trying to loop over input2 each time you hit a match; you'll read all of input2 the first time "@" in line is true and then, as the filepointer is at the end of the file, will not read another line again. Your code needs to gather all matching @ lines from input1 first, then loop over input2 searching for matches, instead. Commented Jan 13, 2014 at 9:10
  • Next thing wrong: line will include a newline character, line[:-1] is the same line without the newline, that last digit is still going to be present. Commented Jan 13, 2014 at 9:11
  • @jonrsharpe Thanks, I'm not sure where to place return but will try; I corrected the name shadowing. Commented Jan 13, 2014 at 9:16

1 Answer 1

1

Split out the reading of the first file from reading the second file; gather all lines you want to match (unless you are reading hundreds of thousands of lines to match). Store all lines you want to match, minus the digit at the end, in a set for fast access.

Then scan the other file for matching lines:

def output(input1, input2):
    with input1:  # automatically close when done
        # set comprehension of all lines starting with @, minus last character
        to_match = {line.strip()[:-1] for line in input1 if line[0] == '@'}

    with input2:
        for line in input2:
            if line[0] == '@' and line.strip()[:-1] in to_match:
                print line.strip()
                for i in range(3):
                    print next(input2, '').strip()

You need to print the matched line too, then read the next three lines (line number 1 was already read).

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks let me try! (actually, I am reading hundreds of thousands of lines...)
@biohazard: Then you really don't want to keep re-reading the input2 file over and over again. If the set doesn't fit in memory, use a database (like sqlite) instead.
Thanks! I tried your script but it says "AttributeError: 'file' object has no attribute 'strip'" for the last line.
@biohazard: right, closing parenthesis in the wrong place; corrected.
It worked with the test data set! Thank you so much!
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.