Remove string and all lines before string from file

Question

I have a filename with thousands of lines of data in it. I am reading in the filename and editing it.

The following tag is about ~900 lines in or more (it varies per file):

<Report name="test" xmlns:cm="http://www.example.org/cm">

I need to remove that line and everything before it in several files. so I need to the code to search for that tag and delete it and everything above it it will not always be 900 lines down, it will vary; however, the tag will always be the same.

I already have the code to read in the lines and write to a file. I just need the logic behind finding that line and removing it and everything before it.

I tried reading the file in line by line and then writing to a new file once it hits on that string, but the logic is incorrect:

readFile = open(firstFile)
lines = readFile.readlines()
readFile.close()
w = open('test','w')
for item in lines:
    if (item == "<Report name="test" xmlns:cm="http://www.example.org/cm">"):
        w.writelines(item)
w.close()

In addition, the exact string will not be the same in each file. The value "test" will be different. I perhaps need to check for the tag name ""<Report name"

Read the file, line by line, and ignore everything before you reach this line. Than write everything after it to an output file. — StoryTeller - Unslander Monica
– StoryTeller - Unslander Monica, Commented Dec 3, 2012 at 22:55
I tried something like this, but I am not getting the result I need. Perhaps my logic is incorrect here: readFile = open(firstFile) lines = readFile.readlines() readFile.close() w = open('test','w') for item in lines: if (item == "<Report name="test" xmlns:cm="domain.org/cm">"): w.writelines(item) w.close() — Chango Mango
– Chango Mango, Commented Dec 3, 2012 at 23:16
Please edit your question and add your code to it. Comments aren't the place to present code. — StoryTeller - Unslander Monica
– StoryTeller - Unslander Monica, Commented Dec 3, 2012 at 23:18

Marius · Accepted Answer · 2012-12-04 00:04:23Z

3

You can use a flag like tag_found to check when lines should be written to the output. You initially set the flag to False, and then change it to True once you've found the right tag. When the flag is True, you copy the line to the output file.

TAG = '<Report name="test" xmlns:cm="http://www.domain.org/cm">'

tag_found = False
with open('tag_input.txt') as in_file:
    with open('tag_output.txt', 'w') as out_file:
        for line in in_file:
            if not tag_found:
                if line.strip() == TAG:
                    tag_found = True
            else:
                out_file.write(line)

PS: The with open(filename) as in_file: syntax is using what Python calls a "context manager"- see here for an overview. The short explanation of them is that they automatically take care of closing the file safely for you when the with: block is finished, so you don't have to remember to put in my_file.close() statements.

edited Dec 4, 2012 at 0:04

answered Dec 3, 2012 at 23:22

Marius

60.5k16 gold badges115 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Chango Mango Over a year ago

This is good. The only problem is that the exact TAG will not be the same in each file. The value "test" will be different. I perhaps need to check for the tag name "Report name" or contains?

Marius Over a year ago

You know your data better than anyone else on here, if you think something like <Report name will capture all the right lines without giving you false matches, then that's what you should match against (you might need to change the test to if line.startswith(TAG) or something).

Chango Mango Over a year ago

Yes, there is only one line that started with <Report name. I will test this out.

Chango Mango Over a year ago

That worked perfectly! I have never used the "with" statement before. If you don't mind, could you walk me through the logic here?

Marius Over a year ago

Check the little bit I've added- it's a way to open files without having to remember to call .close() on the file. The file should always be closed safely, even if something goes wrong with your program.

|

score 0 · Accepted Answer · 2012-12-04 00:20:59Z

You can use a regular expression to match you line:

regex1 = '^<Report name=.*xmlns:cm="http://www.domain.org/cm">$'

Get the index of the item that matches the regex:

listIndex = [i for i, item in enumerate(lines) if re.search(regex, item)]

Slice the list:

listLines = lines[listIndex:]

And write to a file:

with open("filename.txt", "w") as fileOutput:
    fileOutput.write("\n".join(listLines))

pseudocode

Try something like this:

import re

regex1 = '^<Report name=.*xmlns:cm="http://www.domain.org/cm">$' # Variable @name
regex2 = '^<Report name=.*xmlns:cm=.*>$' # Variable @name & @xmlns:cm

with open(firstFile, "r") as fileInput:
    listLines = fileInput.readlines()

listIndex = [i for i, item in enumerate(listLines) if re.search(regex1, item)]
# listIndex = [i for i, item in enumerate(listLines) if re.search(regex2, item)] # Uncomment for variable @name & @xmlns:cm

with open("out_" + firstFile, "w") as fileOutput:
    fileOutput.write("\n".join(lines[listIndex:]))

It's good to explain what you expect your regexes to do, especially if you're suggesting them to people who might not necessarily be familiar with them. Yours assumes that the name= field will change from file to file, but that the xmlns:cn= field won't.
@Marius you're right, the regex would be different in that case

Collectives™ on Stack Overflow

Remove string and all lines before string from file

2 Answers 2

6 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

3 Comments

Linked

Related