23

Lets say I have a Text file with the below content

fdsjhgjhg
fdshkjhk
Start
Good Morning
Hello World
End
dashjkhjk
dsfjkhk

Now I need to write a Python code which will read the text file and copy the contents between Start and end to another file.

I wrote the following code.

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
keepCurrentSet = True
for line in inFile:
    buffer.append(line)
    if line.startswith("Start"):
        #---- starts a new data set
        if keepCurrentSet:
            outFile.write("".join(buffer))
        #now reset our state
        keepCurrentSet = False
        buffer = []
    elif line.startswith("End"):
        keepCurrentSet = True
inFile.close()
outFile.close()

I'm not getting the desired output as expected I'm just getting Start What I want to get is all the lines between Start and End. Excluding Start & End.

1
  • 1
    Are these text files large? Commented Sep 18, 2013 at 6:14

9 Answers 9

59

Just in case you have multiple "Start"s and "End"s in your text file, this will import all the data together, excluding all the "Start"s and "End"s.

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    copy = False
    for line in infile:
        if line.strip() == "Start":
            copy = True
            continue
        elif line.strip() == "End":
            copy = False
            continue
        elif copy:
            outfile.write(line)
Sign up to request clarification or add additional context in comments.

14 Comments

Dears,Thanks for your response I applied the same on real scenerio, I got the following error D:\Python>Python.exe First.py Traceback (most recent call last): File "First.py", line 3, in <module> for line in infile: File "D:\Python\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4591: cha racter maps to <undefined> Can you help me out with this
@user2790219: That's not an error with this code. If you could post the text file that you are using, someone might be able to help (I think you should make that a new question)
This code will not include the strings "Start" and "End" just what is inside them. How would you include the perimeter strings?
@johnnydrama: simply add the outfile.write line within the first two if blocks as well
That's a good observation. However, the presented code is meant grab all the data from multiple instances of "Start" and "End". I've updated my answer to explicitly state that assumption
|
9

If the text files aren't necessarily large, you can get the whole content of the file then use regular expressions:

import re
with open('data.txt') as myfile:
    content = myfile.read()

text = re.search(r'Start\n.*?End', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
    myfile2.write(text)

2 Comments

Regex is way overkill for this problem. Also, you don't handle the case where one of the lines is Ender's Game (the End in the regex needs a newline). Further, the usage of \n is not cross-platform, as windows uses \r\n for line endings
@inspectorG4dget From my experience, regular expressions are never overkill. If you're good with a dialect, it will have predictable behavior. Using them helps to maintain your skills, which is good because they are robust enough to handle nearly every text operation. Still, your answer is elegant and rocks +1.
5

I'm not a Python expert, but this code should do the job.

inFile = open("data.txt")
outFile = open("result.txt", "w")
keepCurrentSet = False
for line in inFile:
    if line.startswith("End"):
        keepCurrentSet = False

    if keepCurrentSet:
        outFile.write(line)

    if line.startswith("Start"):
        keepCurrentSet = True
inFile.close()
outFile.close()

Comments

4

Using itertools.dropwhile, itertools.takewhile, itertools.islice:

import itertools

with open('data.txt') as f, open('result.txt', 'w') as fout:
    it = itertools.dropwhile(lambda line: line.strip() != 'Start', f)
    it = itertools.islice(it, 1, None)
    it = itertools.takewhile(lambda line: line.strip() != 'End', it)
    fout.writelines(it)

UPDATE: As inspectorG4dget commented, above code copies over the first block. To copy multiple blocks, use following:

import itertools

with open('data.txt', 'r') as f, open('result.txt', 'w') as fout:
    while True:
        it = itertools.dropwhile(lambda line: line.strip() != 'Start', f)
        if next(it, None) is None: break
        fout.writelines(itertools.takewhile(lambda line: line.strip() != 'End', it))

1 Comment

Two issues: (1) \n is not cross-platform - Windows uses \r\n. (2) This doesn't handle multiple blocks at all - it only copies over the first block
2

Move the outFile.write call into the 2nd if:

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
for line in inFile:
    if line.startswith("Start"):
        buffer = ['']
    elif line.startswith("End"):
        outFile.write("".join(buffer))
        buffer = []
    elif buffer:
        buffer.append(line)
inFile.close()
outFile.close()

Comments

1
import re

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer1 = ""
keepCurrentSet = True
for line in inFile:
    buffer1=buffer1+(line)

buffer1=re.findall(r"(?<=Start) (.*?) (?=End)", buffer1)  
outFile.write("".join(buffer1))  
inFile.close()
outFile.close()

1 Comment

This will fail in cases where the lines Starting awesome sentence and Ender's Game exist in the file
1

I would handle it like this :

inFile = open("data.txt")
outFile = open("result.txt", "w")

data = inFile.readlines()

outFile.write("".join(data[data.index('Start\n')+1:data.index('End\n')]))
inFile.close()
outFile.close()

1 Comment

Very inefficient use of memory in the worst case, and doesn't handle multiple blocks
1

Files are iterators in Python, so this means you don't need to hold a "flag" variable to tell you what lines to write. You can simply use another loop when you reach the start line, and break it when you reach the end line:

with open("data.txt") as in_file, open("result.text", 'w') as out_file:
    for line in in_file:
        if line.strip() == "Start":
            for line in in_file:
                if line.strip() == "End":
                    break
                out_file.write(line)

3 Comments

what if it's the same keyword, and I want to extract everything in between the 2nd appearance of that word
@uniquegino In that case you can add a "'flag" variable to count the keyword and enter the second loop when the count satisfies your condition
yes flag does work, thank you. found a similar idea here that' works good sopython.com/canon/92/…
0

if one wants to keep the start and end lines/keywords while extracting the lines between 2 strings.

Please find below the code snippet that I used to extract sql statements from a shell script

def process_lines(in_filename, out_filename, start_kw, end_kw):
    try:
        inp = open(in_filename, 'r', encoding='utf-8', errors='ignore')
        out = open(out_filename, 'w+', encoding='utf-8', errors='ignore')
    except FileNotFoundError as err:
        print(f"File {in_filename} not found", err)
        raise
    except OSError as err:
        print(f"OS error occurred trying to open {in_filename}", err)
        raise
    except Exception as err:
        print(f"Unexpected error opening {in_filename} is",  repr(err))
        raise
    else:
        with inp, out:
            copy = False
            for line in inp:
                # first IF block to handle if the start and end on same line
                if line.lstrip().lower().startswith(start_kw) and line.rstrip().endswith(end_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    copy = False
                    continue
                elif line.lstrip().lower().startswith(start_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    continue
                elif line.rstrip().endswith(end_kw):
                    if copy:  # keep the ends with keyword
                        out.write(line)
                    copy = False
                    continue
                elif copy:
                    # write
                    out.write(line)


if __name__ == '__main__':
    infile = "/Users/testuser/Downloads/testdir/BTEQ_TEST.sh"
    outfile = f"{infile}.sql"
    statement_start_list = ['database', 'create', 'insert', 'delete', 'update', 'merge', 'delete']
    statement_end = ";"
    process_lines(infile, outfile, tuple(statement_start_list), statement_end)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.