1

New to Python here.

I am trying to get the most active ip address from a log.txt file and print it in another text file. My first step is to get all the ip addresses. Second to sort the most occurring ip address. But I am stuck in the first step which is:

with open('./log_input/log.txt', 'r+') as f:
    # loops the lines in teh text file
    for line in f:
        # split line at whitespace
        cols = line.split()

        # get last column
        byte_size = cols[-1]

        # get the first column [0]
        ip_addresses = cols[0]

        # remove brackets
        byte_size = byte_size.strip('[]')

        # write the byte size in the resource file
        resource_file = open('./log_output/resources.txt', 'a')
        resource_file.write(byte_size + '\n')
        resource_file.truncate()
        # write the ip addresses in the host file
        host_file = open('./log_output/hosts.txt', 'a')
        host_file.seek(0)
        host_file.write(ip_addresses + '\n')
        host_file.truncate()

    resource_file.close()
    host_file.close()

The problem is in the new host.txt file, it reprints the ip addresses instead of overwriting. I tried this too:

    resource_file = open('./log_output/resources.txt', 'w')
    host_file = open('./log_output/hosts.txt', 'w')

and 'w+' and so on.. but w or w+ gives only one ip address in the host file.

Can someone guide me through this?

Sample Input File

www-c2.proxy.aol.com - - [01/Jul/1995:00:03:52 -0400] "GET /history/skylab/skylab-1.html HTTP/1.0" 200 1659
isdn6-34.dnai.com - - [01/Jul/1995:00:03:52 -0400] "GET /images/kscmap-tiny.gif HTTP/1.0" 200 2537
isdn6-34.dnai.com - - [01/Jul/1995:00:03:52 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635 
ix-ftw-tx1-24.ix.netcom.com - - [01/Jul/1995:00:03:52 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310
7
  • I'd first suggest opening your resource file only once: resource_file = open('./log_output/resources.txt', 'a') should go before you start the for loop. Same for the host_file. Commented Apr 4, 2017 at 16:47
  • Can you post a few example lines of the input file so we can test? Commented Apr 4, 2017 at 16:49
  • it reprints the ip addresses instead of overwriting ... I have no idea what that means. What do you want to be in that file? All addresses with duplicates, all addresses without duplicates? Commented Apr 4, 2017 at 16:51
  • One problem is that you write and truncate but don't close the file. So the next host_file = open('./log_output/hosts.txt', 'a') opens an outdated version of the file and then as it reassigns host_file, the prior loop's data is flushed to the file. Close the thing after you use it or put it in a with clause. Commented Apr 4, 2017 at 16:53
  • www-c2.proxy.aol.com - - [01/Jul/1995:00:03:52 -0400] "GET /history/skylab/skylab-1.html HTTP/1.0" 200 1659 isdn6-34.dnai.com - - [01/Jul/1995:00:03:52 -0400] "GET /images/kscmap-tiny.gif HTTP/1.0" 200 2537 isdn6-34.dnai.com - - [01/Jul/1995:00:03:52 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635 ix-ftw-tx1-24.ix.netcom.com - - [01/Jul/1995:00:03:52 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310 Commented Apr 4, 2017 at 17:06

2 Answers 2

1

collections.Counter is a handy tool for counting things. Feed it a bunch of text strings and it will create a dict mapping the text to the number of times that text is seen. Now counting IP addresses is easy

>>> import collections
>>> with open('log.txt') as fp:
...     counter = collections.Counter(line.split(' ', 1)[0].lower() for line in fp)
... 
>>> counter
Counter({'isdn6-34.dnai.com': 2, 'ix-ftw-tx1-24.ix.netcom.com': 1, 'www-c2.proxy.aol.com': 1})
>>> counter.most_common(1)
[('isdn6-34.dnai.com', 2)]
>>>
>>>
>>> with open('most_common.txt', 'w') as fp:
...     fp.write(counter.most_common(1)[0][0])
... 
17
>>> open('most_common.txt').read()
'isdn6-34.dnai.com'
Sign up to request clarification or add additional context in comments.

Comments

0

Thanks for all the help and suggestion.. this fixed my problem.

with open('./log_input/log.txt', 'r+') as f:

# loops the lines in teh text file
new_ip_addresses = ""
new_byte_sizes = ""
new_time_stamp = ""
resource_file = open('./log_output/resources.txt', 'w')
host_file = open('./log_output/hosts.txt', 'w')
hours_file = open('./log_output/hours.txt', 'w')

for line in f:
    # print re.findall("\[(.*?)\]", line)  # ['Hi all', 'this is', 'an example']

    # split line at whitespace
    cols = line.split(' ')

    #get the time stamp times


    # print(cols[4])

    # get byte sizes from the
    byte_size = cols[-1]
    new_byte_sizes += byte_size

    # get  ip/host
    ip_addresses = cols[0]
    new_ip_addresses += ip_addresses + '\n'

    # remove brackets
    byte_size = byte_size.strip('[]')

# write the byte size in the resource file
print(new_byte_sizes)
resource_file.write(new_byte_sizes)
resource_file.close()

# write the ip addresses in the host file
print(new_ip_addresses)
host_file.write(new_ip_addresses)
host_file.close()

# write the ip addresses in the host file
print(new_ip_addresses)
host_file.write(new_ip_addresses)
host_file.close()

Basically assigning the value to the variable inside the for loop and adding new line solved it for me.

new_ip_addresses += ip_addresses + '\n'

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.