Reading file between headers in python

Question

I have a large text file which have values separated by a header starting with "#". If the condition matches the one in the header I would like to read the file until the next header "#" and SKIP rest of the file.

To test that I'm trying to read the following text file named as test234.txt:

# abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
# something
njndjen kj
ejkndjke
#vcrvr

The code I wrote is:

file_t = open('test234.txt')
cond = True
while cond:
    for line_ in file_t:
        print(line_)
        if file_t.read(1) == "#":
            cond = False
file_t.close()

But, the output I'm getting is:

# abcdefgh

fnrnf

rkfr

foiernfr

erfnr

something

jndjen kj

jkndjke

vcrvr

Instead I would like the output between two headers separated by "#" which is:

1fnrnf
mrkfr
nfoiernfr
nerfnr

How can I do that? Thanks!

EDIT: Reading in file block by block using specified delimiter in python talks about reading file in groups separated by headers but I don't want to read all the headers. I only want to read the header where a given condition is met and as soon as the line reaches the next header marked by '#' it stops reading the file.

The line has a new line character at the end and print adds another one. Use print(line.rstrip()) to remove the trailing new line.. — Matthias
– Matthias, Commented Feb 26, 2018 at 15:52
Is your file using windows line endings \r\n? If so, use the rsrip method. — Alexander Ejbekov
– Alexander Ejbekov, Commented Feb 26, 2018 at 15:52
Yes, the file has \n character but I simply want the output between the 2 headers specified by "#" — Light_B
– Light_B, Commented Feb 26, 2018 at 15:55
Possible duplicate of Reading in file block by block using specified delimiter in python — Chris_Rands
– Chris_Rands, Commented Feb 26, 2018 at 16:16

hiro protagonist · Accepted Answer · 2018-02-26 19:32:43Z

itertools.groupby can help:

from io import StringIO
from itertools import groupby

text = '''# abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
# something
njndjen kj
ejkndjke
#vcrvr'''


with StringIO(text) as file:
    lines = (line.strip() for line in file)  # removing trailing '\n'
    for key, group in groupby(lines, key=lambda x: x[0]=='#'):

        if key is True:
            # found a line that starts with '#'
            print('found header: {}'.format(next(group)))

        if key is False:
            # group now contanins all lines that do not start with '#'
            print('\n'.join(group))

note that all of this is lazy. you'd only ever have all the items between two headers in memory.

you'd have to replace the with StringIO(text) as file: with; with open('test234.txt', 'r') as file:...

the output for your test is:

found header: # abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
found header: # something
njndjen kj
ejkndjke
found header: #vcrvr

UPDATE as i misunderstood. here is a fresh attempt:

from io import StringIO
from collections import deque
from itertools import takewhile

from_line = '# abcdefgh'
to_line = '# something'

with StringIO(text) as file:
    lines = (line.strip() for line in file)  # removing trailing '\n'

    # fast-forward up to from_line
    deque(takewhile(lambda x: x != from_line, lines), maxlen=0)

    for line in takewhile(lambda x: x != to_line, lines):
        print(line)

where i use itertools.takewhile to get an iterator over the lines until a contition is met (until the first header is found in your case).

the deque part is just the consume pattern suggested in the itertools recipes. it just fast-forwards to the point where the given condition does not hold anymore.

Could you explain how the groupby is working? Is group iterating line by line in the for loop? Also, since the original file which I would try to read is pretty big. So, is the code reading the file line by line or all the lines at once? Thanks!
as mentioned: this is lazy. everything used here is a generator. so yes: the file is treated line-by-line and not read as a whole. groupby reads until the condition (first character in the line == '#'?) changes and treturns the key (the value of the condition) and an iterator over the group (which is all the lines in between). the documentation is pretty helpful.
It took me some time to understand your solution being a beginner. It's removing the headers and grouping all the data as one. Instead, I would only like to read the data between two given headers and skip the whole file as specified in the question. Maybe you already meant it but I'm not able to work it out. Also, the question asked by Chris is reading the whole data in different sections separated by headers whereas I only want to read one section of my data specified by a given header and skip everything else.
Thanks for the update. I also found using regex to be very simple for understanding as a beginner as suggested by @accumulatorax in the other solution. Do you think what you suggested is faster & efficient over using regex? I've developed my own solution building on accumulatorax solution. I can post it for comparison?
if you wonder about speed, there is the timeit module. i'd say that regex is overkill if you know exactly what the header you are looking for looks like. regex is great if you know it's structure only.

mujdecisy · Accepted Answer · 2018-02-26 16:25:41Z

1

Learn and use regex. It will help you for all document signification processes.

import re #regex library

with open('test234.txt') as f:  #file stream
    lines = f.readlines()       #reads all lines

p = re.compile('^#.*')          #regex pattern creation

for l in lines:
    if p.match(l) == None:      #looks for non-matching lines
        print(l[:-2])

edited Feb 26, 2018 at 16:25

answered Feb 26, 2018 at 16:07

mujdecisy

113 bronze badges

4 Comments

Light_B Over a year ago

Could you add some comments for a beginner like me to understand more? What is re.compile doing there?

mujdecisy Over a year ago

Regular expression logic provides you to finding patterns (described by yourself) in strings. ^#.* means that you are looking for string pieces starts with # mark. Check that out. For some more info.

Light_B Over a year ago

Will it work if I don't want to read all the lines at once since the file is pretty large?

mujdecisy Over a year ago

Of course you can do it in "with" indent in while loop with readline() function

Collectives™ on Stack Overflow

Reading file between headers in python

2 Answers 2

7 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

4 Comments

Linked

Related