2

I have some large data files and I want to copy out certain pieces of data on each line, basically an ID code. The ID code has a | on one side and a space on the other. I was wondering would it be possible to pull out just the ID. Also I have two data files, one has 4 ID codes per line and the other has 23 per line.

At the moment I'm thinking something like copying each line from the data file, then subtract the strings from each other to get the desired ID code, but surely there must be an easier way! Help?

Here is an example of a line from the data file that I'm working with

cluster8032:  WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327

and from this line I would want to output on separate lines

Wood_4286
EIK58010
AEV644870.1
PSEBR_a4327
2
  • 'something like copying each line from the data file, then subtract the strings from each other' - can you show us your code? Commented Jul 25, 2012 at 13:51
  • Do you want to search for a particular cluster8032 number, or do you want every line to produce four (or twenty-three) lines of output? Commented Jul 25, 2012 at 14:03

2 Answers 2

5

Use the regex module for such a task. The following code shows you how to extract the ID's from a string (works for any number of ID's as long as they are structured the same way).

import re
s = """cluster8032:  WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327"""
results = re.findall('\|([^ ]*)',s) #list of ids that have been extracted from string
print('\n'.join(results)) #pretty output

Output:

Wood_4286
EIK58010
AEV64487.1
PSEBR_a4327

To write the output to a file:

with open('out.txt', mode = 'w') as filehandle:
    filehandle.write('\n'.join(results))

For more information, see the regex module documentation.

Sign up to request clarification or add additional context in comments.

3 Comments

Your output doesn't match the output from the question. You need to do use a greedy star and follow by a space: '\|([^|]*?) '
Yup, I noticed that (I had misread the question). It has now been fixed, thanks. The code above functions correctly.
Yes, that's how you do it. I've appended that to my answer.
1

If all your lines have the given format, a simple split is enough:

#split by '|' and the result by space
ids = [x.split()[0] for x in line.split("|")[1:]]

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.