Extracting line data based on specific pattern in text file using python

Question

I have a huge report file with some data where i have to do some data processing on lines starting with the code "MLT-TRR" For now i have extracted all the lines in my script that start with that code and placed them in a separate file. The new file looks like this- Rules.txt.

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\top.png  63   10   Port is not registered [Folder: 'Picture']

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\tree.png 315  10   Port is not registered [Folder: 'Picture.first_inst']

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\top.png  315  10   Port is not registered [Folder: 'Picture.second_inst']

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\tree.png 317  10   Port is not registered [Folder: 'Picture.third_inst']

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\top.png  317  10   Port is not registered [Folder: 'Picture.fourth_inst']

For each of these lines i have to extract the data that lies after "[Folder: 'Picture" If there is no data after "[Folder: 'Picture" as in the case of my first line, then skip that line and move on to the next line. I also want to extract the file names for each of those lines- top.txt, tree.txt

I couldnt think of a simpler method to do this as this involves a loop and gets messier. Is there any way out i can do this? extracting just the file paths and the ending data of each line.

import os
import sys
from os import path
import numpy as np


folder_path = os.path.dirname(os.path.abspath(__file__))
inFile1 = 'Rules.txt'
inFile2 = 'TopRules.txt'

def open_file(filename):
    try:
        with open(filename,'r') as f:
            targets = [line for line in f if "MLT-TRR" in line]
            print targets
        f.close()
        with open(inFile1, "w") as f2:
            for line in targets:
                f2.write(line + "\n")
        f2.close()
        
    except Exception,e:
        print str(e)
    exit(1)


if __name__ == '__main__':
    name = sys.argv[1]
    filename = sys.argv[1]
    open_file(filename)

You've asked several different things here: (a) how to extract the relevant data items from each line; (b) how to check whether data item exists in another file; (c) how to append a line to specified file. It is unclear which of these tasks you actually need assistance with, but for whichever ones you do, please can you break these down into separate questions. — alani
– alani, Commented Aug 27, 2020 at 3:53
@alani...For each of the lines ..i want to extract the file names top.txt, tree.txt and the data that lies after the pattern " 'Picture." — k11
– k11, Commented Aug 27, 2020 at 5:14
Okay, in that case edit the question to remove all the discussion of what you intend to do with that information once you have found it (checking Report.txt and appending "updated" to another file), because it is a distraction and makes it look like you want people to solve a bigger problem. Simply state what data you want to extract from the line. — alani
– alani, Commented Aug 27, 2020 at 5:17

Aaron Keesing · Accepted Answer · 2020-08-28 05:59:06Z

1

To extract the filenames and other data, you should be able to use a regular expression:

import re

for line in f:
    match = re.match(r"^MLT-TRR.*([A-Za-z]:\\[-A-Za-z0-9_:\\.]+).*\[Folder: 'Picture\.(\w+)']", line)
    if match:
        filename = match.group(1)
        data = match.group(2)

This assumes that the data after 'Picture. only contains alphanumeric characters and underscores. And you may have to change the allowed characters in the filename part [A-Za-z0-9_:\\.] if you have weird filenames. It also assumes the filenames start with the Windows drive letter (so absolute paths), to make it easier to distinguish from other data in the line.

If you just want the basename of the filename, then after extracting it you can use os.path.basename or pathlib.Path.name.

edited Aug 28, 2020 at 5:59

answered Aug 27, 2020 at 5:58

Aaron Keesing

1,28711 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

k11 Over a year ago

thanks for the suggestion...what changes do i need to do in your code incase the input file has extra data in between the patterns to be extracted?

Aaron Keesing Over a year ago

It might depend on what the extra data is. I looked at your updated question, and modifying the regular expression like ^MLT-TRR.*([A-Za-z]:\\[-A-Za-z0-9_:\\.]+).*\[Folder: 'Picture\.(\w+)'] might be sufficient. So now it requires the filename to start with Windows drive letter (e.g. C:`), and it accepts any characters (.*`) in between the 'MLT-TRR', filename, and Picture folder.

WillH · Accepted Answer · 2020-08-27 05:56:02Z

I had a very similar problem and solved it by searching for the specific line 'key', in your case MLT-TRR" with regex and then specifying which 'bytes' to take from that line. I then append the selected data to an array.

import re #Import the regex function
#Make empty arrays:
    P190=[] #my file
    shot=[] #events in my file (multiple lines of text for each event)
    S011east=[] #what I want
    S011north #another thing I want

#Create your regex:
    S011=re.compile(r"^S0\w*\W*11\b") 

#search and append:
    #Open P190 file
    with open(import_file_path,'rt') as infile:
        for lines in infile:
            P190.append(lines.rstrip('\n'))       
    #Locate specific lines and extract data
    for line in P190:
        if  S011.search(line)!= None:
            easting=line[47:55]
            easting=float(easting)
            S011east.append(easting)
            northing=line[55:64]
            northing=float(northing)
            S011north.append(northing)

If you set up regex to look for "MLT_TRR ????? Folder: 'Picture.'" then it should skip any lines that don't have any further information.

For the second part of your question. I doubt your file names are a constant length so the above method won't work as you can't specify a number of bytes to extract.This code extracts the name and extension from a file path, you could apply it to whatever you extract from each line.

import os
tail=os.path.basename(import_file_path) #Get file name from path

Collectives™ on Stack Overflow

Extracting line data based on specific pattern in text file using python

2 Answers 2

2 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Related