3

I have a filename with thousands of lines of data in it. I am reading in the filename and editing it.

The following tag is about ~900 lines in or more (it varies per file):

<Report name="test" xmlns:cm="http://www.example.org/cm">

I need to remove that line and everything before it in several files. so I need to the code to search for that tag and delete it and everything above it it will not always be 900 lines down, it will vary; however, the tag will always be the same.

I already have the code to read in the lines and write to a file. I just need the logic behind finding that line and removing it and everything before it.

I tried reading the file in line by line and then writing to a new file once it hits on that string, but the logic is incorrect:

readFile = open(firstFile)
lines = readFile.readlines()
readFile.close()
w = open('test','w')
for item in lines:
    if (item == "<Report name="test" xmlns:cm="http://www.example.org/cm">"):
        w.writelines(item)
w.close()

In addition, the exact string will not be the same in each file. The value "test" will be different. I perhaps need to check for the tag name ""<Report name"

3
  • 1
    Read the file, line by line, and ignore everything before you reach this line. Than write everything after it to an output file. Commented Dec 3, 2012 at 22:55
  • I tried something like this, but I am not getting the result I need. Perhaps my logic is incorrect here: readFile = open(firstFile) lines = readFile.readlines() readFile.close() w = open('test','w') for item in lines: if (item == "<Report name="test" xmlns:cm="domain.org/cm">"): w.writelines(item) w.close() Commented Dec 3, 2012 at 23:16
  • 1
    Please edit your question and add your code to it. Comments aren't the place to present code. Commented Dec 3, 2012 at 23:18

2 Answers 2

3

You can use a flag like tag_found to check when lines should be written to the output. You initially set the flag to False, and then change it to True once you've found the right tag. When the flag is True, you copy the line to the output file.

TAG = '<Report name="test" xmlns:cm="http://www.domain.org/cm">'

tag_found = False
with open('tag_input.txt') as in_file:
    with open('tag_output.txt', 'w') as out_file:
        for line in in_file:
            if not tag_found:
                if line.strip() == TAG:
                    tag_found = True
            else:
                out_file.write(line)

PS: The with open(filename) as in_file: syntax is using what Python calls a "context manager"- see here for an overview. The short explanation of them is that they automatically take care of closing the file safely for you when the with: block is finished, so you don't have to remember to put in my_file.close() statements.

Sign up to request clarification or add additional context in comments.

6 Comments

This is good. The only problem is that the exact TAG will not be the same in each file. The value "test" will be different. I perhaps need to check for the tag name "Report name" or contains?
You know your data better than anyone else on here, if you think something like <Report name will capture all the right lines without giving you false matches, then that's what you should match against (you might need to change the test to if line.startswith(TAG) or something).
Yes, there is only one line that started with <Report name. I will test this out.
That worked perfectly! I have never used the "with" statement before. If you don't mind, could you walk me through the logic here?
Check the little bit I've added- it's a way to open files without having to remember to call .close() on the file. The file should always be closed safely, even if something goes wrong with your program.
|
0

You can use a regular expression to match you line:

regex1 = '^<Report name=.*xmlns:cm="http://www.domain.org/cm">$'

Get the index of the item that matches the regex:

listIndex = [i for i, item in enumerate(lines) if re.search(regex, item)]

Slice the list:

listLines = lines[listIndex:]

And write to a file:

with open("filename.txt", "w") as fileOutput:
    fileOutput.write("\n".join(listLines))

pseudocode

Try something like this:

import re

regex1 = '^<Report name=.*xmlns:cm="http://www.domain.org/cm">$' # Variable @name
regex2 = '^<Report name=.*xmlns:cm=.*>$' # Variable @name & @xmlns:cm

with open(firstFile, "r") as fileInput:
    listLines = fileInput.readlines()

listIndex = [i for i, item in enumerate(listLines) if re.search(regex1, item)]
# listIndex = [i for i, item in enumerate(listLines) if re.search(regex2, item)] # Uncomment for variable @name & @xmlns:cm

with open("out_" + firstFile, "w") as fileOutput:
    fileOutput.write("\n".join(lines[listIndex:]))

3 Comments

It's good to explain what you expect your regexes to do, especially if you're suggesting them to people who might not necessarily be familiar with them. Yours assumes that the name= field will change from file to file, but that the xmlns:cn= field won't.
@Marius you're right, the regex would be different in that case
This is a great solution as well!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.