Parse a Generated File Python

Question

I'm trying to parse generated files into a list of objects.

Unfortunately the structure of the generated files is not always the same, but they contain the same fields (and lots of other garbage).

For example:

    function foo();              # Don't Care
    function maybeanotherfoo();  # Don't Care
    int maybemoregarbage;        # Don't Care

    
    product_serial = "CDE1102"; # I want this <---------------------
    unnecessary_info1 = 10;     # Don't Care
    unnecessary_info2 = "red"   # Don't Care
    product_id = 1134412;       # I want this <---------------------
    unnecessary_info3 = "88"    # Don't Care

    product_serial = "DD1232";  # I want this <---------------------
    product_id = 3345111;       # I want this <---------------------
    unnecessary_info1 = "22"    # Don't Care
    unnecessary_info2 = "panda" # Don't Care

    product_serial = "CDE1102"; # I want this <---------------------
    unnecessary_info1 = 10;     # Don't Care
    unnecessary_info2 = "red"   # Don't Care
    unnecessary_info3 = "bear"  # Don't Care
    unnecessary_info4 = 119     # Don't Care
    product_id = 1112331;       # I want this <---------------------
    unnecessary_info5 = "jj"    # Don't Care

I want a list of objects (each object has: serial and id).

I have tried the following:


import re

class Product:
    def __init__(self, id, serial):
        self.product_id = id
        self.product_serial = serial

linenum = 0
first_string = "product_serial"
second_string = "product_id"
with open('products.txt', "r") as products_file:
    for line in products_file:
        linenum += 1
        if line.find(first_string) != -1:
            product_serial = re.search('\"([^"]+)', line).group(1)
            #How do I proceed?

Any advice would be greatly appreciated! Thanks!

So what does your code do? Does it work? Are there errors? If so, what are they? — MattDMo
– MattDMo, Commented Sep 8, 2020 at 17:49
My code can find the first product_serial (CDE1102). But how can I then find the product_id and then continue parsing from that point on? — Tal J
– Tal J, Commented Sep 8, 2020 at 17:50
Please repeat on topic and how to ask from the intro tour. “Show me how to solve this coding problem” is not a Stack Overflow issue. You have to make an honest attempt, and then ask a specific question about your algorithm or technique. "Any advice" is far too broad for Stack Overflow. There are many tutorials that show you how to read a file, how to process string data, etc. You should be able to identify a constant string in the input and to separate input lines. — Prune
– Prune, Commented Sep 8, 2020 at 17:52

AKX · Accepted Answer · 2020-09-08 17:54:04Z

I've inlined the data here using an io.StringIO(), but you can substitute data for your products_file.

The idea is that we gather key/values into current_object, and as soon as we have all the data we know we need for a single object (the two keys), we push it onto a list of objects and prime a new current_object.

You could use something like if line.startswith('product_serial') instead of the admittedly complex regexp.

import io
import re

data = io.StringIO("""
    function foo();             
    function maybeanotherfoo(); 
    int maybemoregarbage;       

    
    product_serial = "CDE1102"; 
    unnecessary_info1 = 10;     
    unnecessary_info2 = "red"   
    product_id = 1134412;       
    unnecessary_info3 = "88"    

    product_serial = "DD1232";  
    product_id = 3345111;       
    unnecessary_info1 = "22"    
    unnecessary_info2 = "panda" 

    product_serial = "CDE1102"; 
    unnecessary_info1 = 10;     
    unnecessary_info2 = "red"   
    unnecessary_info3 = "bear"  
    unnecessary_info4 = 119     
    product_id = 1112331;       
    unnecessary_info5 = "jj"    
""")

objects = []

current_object = {}
for line in data:
    line = line.strip()  # Remove leading and trailing whitespace
    m = re.match(r'^(product_id|product_serial)\s*=\s*(\d+|"(?:.+?)");?$', line)

    if m:
        key, value = m.groups()
        current_object[key] = value.strip('"')
        if len(current_object) == 2:  # Got the two keys we want, ship the object
            objects.append(current_object)
            current_object = {}

print(objects)

Collectives™ on Stack Overflow

Parse a Generated File Python

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related