-2

I'm Having this log file "internet.log" which is about 10GB. When I parse it in python I get an exception "MemoryError". The log file looks something like this...

Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi uno.gycpi.b.yahoodns.net is 216.115.100.123
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124
Jun 15 16:26:21 dnsmasq[1979]: reply fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106
Jun 15 16:26:21 dnsmasq[1979]: query[A] fd-geoycpi-uno.gycpi.b.yahoodns.net from 192.168.1.33
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.106
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.124
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 216.115.100.123
Jun 15 16:26:21 dnsmasq[1979]: cached fd-geoycpi-uno.gycpi.b.yahoodns.net is 74.6.160.107
Jun 15 16:26:23 dnsmasq[1979]: query[A] armdl.adobe.com from 192.168.1.24

I'm currently using this method to parse the log file:

def parse():
Date = []
IPAddress = []
DomainsVisited = []
with open("internet.log", "r") as file:
    content = file.readlines()
    for items in content:
        if 'query[A]' in items:
            getDate(Date, items)
            getIPAddress(IPAddress, items)
            getDomainsVisited(DomainsVisited, items)
finalResult = [[i, j, k] for i, j, k in zip(Date, IPAddress, DomainsVisited)]
return display(finalResult)

If I parse a log file of say some 10MB the output is being displayed but when I go to parse the 10GB log file I get the error. How can I Fix this? Thank you.

4
  • 2
    Well, you're reading the whole file into memory with file.readlines(). Saying for items in file: will read it a line at a time. Commented Jun 22, 2018 at 8:41
  • 1
    The rest of your code doesn't look right though. E.g. for each item you are clobbering Date rather than appending to the list. Commented Jun 22, 2018 at 8:42
  • @PeterWood Ya sorry I'll change it Commented Jun 22, 2018 at 8:53
  • @PeterWood for items in file also didn't work. I got this message in python console "Process finished with exit code 247" Commented Jun 22, 2018 at 8:55

2 Answers 2

0

You should not use file.readlines(). Doing so immediately reads the whole file into memory, which is likely to fill it up immediately. Instead, iterate over the file:

with open("internet.log", "r") as file:
    for items in file:

(Of course, depending on what you're doing with the data this could still break as you go through the file.)

Sign up to request clarification or add additional context in comments.

1 Comment

Didn't work :( I got this message in python console "Process finished with exit code 247"
0

You're reading the whole file into memory with readlines.

You can read a line at a time by saying for items in file.

Cleaning up your code a little, using better variable names, and a list comprehension to build the result:

with open("internet.log") as log:
    finalResults = [[getDate(line), getIPAddress(line), getDomainsVisited(line)]
                    for line in log
                    if 'query[A]' in line]

I would extract the result to a function:

def parse_log_line(line):
    return [getDate(line),
            getIPAddress(line),
            getDomainsVisited(line)]

Then your code would be:

with open("internet.log") as log:
    finalResults = [parse_log_line(line)
                    for line in log
                    if 'query[A]' in line]

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.