Parse HTTP header using Python and tcpflow

Question

I wrote a program that reads a pcap file and parses the HTTP traffic in the pcap to generate a dictionary that contains HTTP headers for each request and response in this pcap.

My code does the following:

Uses tcpflow to reassemble the tcp segments
Read the files generated by tcpflow and check if it related to HTTP
If the file contains HTTP traffic, my code will read the file and generate a corresponding dictionary that contains the HTTP header fields.

I test my code with multiple test cases, but honestly I don't have a good experience in Python, so could anyone check it for me please?

import os
from os import listdir
from os.path import isfile, join
from StringIO import StringIO
import mimetools
def getFields(headers):
    fields={}
    i=1
    for header in headers:
        if len(header)==0:
           continue

        # if this line is complement for the previous line   
        if header.find(" ")==0 or 
           header.find("\t")==0:
           continue

        if len(header.split(":"))>=2:
           key = header.split(":")[0].strip()

           # if the key has multiple values such as cookie
           if fields.has_key(key):
              fields[key]=fields[key]+" "+header[header.find(":")+1:].strip()
           else:
              fields[key]=header[header.find(":")+1:].strip()

              while headers[i].find(" ")==0 or  
                    headers[i].find("\t")==0 :
                    fields[key]=fields[key]+" "+headers[i].strip()
                    i=i+1
              # end of the while loop
          # end of the else

        else:
             # else for [if len(header.split(":"))>=2: ]
             print "ERROR: RFC VIOLATION"

      # end of the for loop
    return fields            




def main():
    # you have to write it in the terminal "cd /home/user/Desktop/empty-dir"
    os.system("tcpflow -r /home/user/Desktop/12.pcap -v")

    for f in listdir("/home/user/Desktop/empty-dir"):
        if f.find("80")==19 or f.find("80")==41:
           with open("/home/user/Desktop/empty-dir"+f) as fh:
                fields={}
                content=fh.read()  #to test you could replace it with content="any    custom http header"
                if content.find("\r\n\r\n")==-1:
                   print "ERROR: RFC VIOLATION"
                   return
                headerSection=content.split("\r\n\r\n")[0]
                headerLines=headerSection.split("\r\n")
                firstLine=headerLines[0]
                firstLineFields=firstLine.split(" ")            
                if len(headerLines)>1:
                   fields=getFields(headerLines[1:])
                if len(firstLineFields)>=3:                     
                   if firstLine.find("HTTP")==0:
                      fields["Version"]=firstLineFields[0]
                      fields["Status-code"]=firstLineFields[1]
                      fields["Status-desc"]=" ".join(firstLineFields[2:])
                   else:
                      fields["Method"]=firstLineFields[0]
                      fields["URL"]=firstLineFields[1]
                      fields["Version"]=firstLineFields[2]
                else:
                  print "ERROR: RFC VIOLATION"
                  continue 
                print fields
                print "__________________"

    return 0

if __name__ == '__main__':
 main()

Welcome to Code Review! This is a good question, thank you for taking the time to form it so that we can help show you the proper coding styles and techniques. We all look forward to seeing more of your posts! — Malachi
– Malachi, Commented Jul 22, 2014 at 18:34
Why create a main() function, this is not C, just put everything that your main() function does under the if __name__ == '__main__': it works like C. — matheussilvapb
– matheussilvapb, Commented Jul 22, 2014 at 19:01
@matheussilvapb A main() function is not entirely a bad idea, I think. — 200_success
– 200_success, Commented Jul 22, 2014 at 19:04
@200_success It's like doing: a = 2; b = 3; c = a + b; if I wont need a and b anymore, just need to c be equals to 5... — matheussilvapb
– matheussilvapb, Commented Jul 22, 2014 at 19:07

Malachi · Accepted Answer · 2014-07-22 19:34:23Z

5

New Lines and indentations help the interpreter know where the code terminates and blocks end, you have to be super careful with them

Like in your if condition, you can't have a newline in between the conditions.

if header.find(" ")==0 or 
    header.find("\t")==0:
    continue

This code will error out because you can't have a new line in your condition statement.

Python is New Line Terminated. It should read like this

if header.find(" ")==0 or header.find("\t")==0
    continue

Same with this piece of code

while headers[i].find(" ")==0 or  
    headers[i].find("\t")==0 :
    fields[key]=fields[key]+" "+headers[i].strip()
    i=i+1

It should read:

while headers[i].find(" ")==0 or headers[i].find("\t")==0 :
    fields[key]=fields[key]+" "+headers[i].strip()
    i=i+1

edited Jul 22, 2014 at 19:34

answered Jul 22, 2014 at 18:55

Malachi

29.1k11 gold badges87 silver badges188 bronze badges

\$\begingroup\$ when I wrote my code in gedit the indentation was correct,but when I copy/past the code here the indentation is changed. \$\endgroup\$

Raghda Hraiz
– Raghda Hraiz

2014-07-22 19:12:24 +00:00
Commented Jul 22, 2014 at 19:12
\$\begingroup\$ @RaghdaHraiz Edit your question code, but I think you still had issues with Scope of variables \$\endgroup\$

Malachi
– Malachi

2014-07-22 19:14:13 +00:00
Commented Jul 22, 2014 at 19:14
\$\begingroup\$ regarding the while statement, it should be inside the loop and in the else statement. the else which print the error message is related to this if statement if len(header.split(":"))>=2: @Malachi \$\endgroup\$

Raghda Hraiz
– Raghda Hraiz

2014-07-22 19:15:38 +00:00
Commented Jul 22, 2014 at 19:15
1

\$\begingroup\$ I edited the code indentation and added comments ..could you check it now @Malachi \$\endgroup\$

Raghda Hraiz
– Raghda Hraiz

2014-07-22 19:30:40 +00:00
Commented Jul 22, 2014 at 19:30
1

\$\begingroup\$ @RaghdaHraiz: Please be aware that code edits based on answers are normally disallowed, but this is a somewhat different case. If someone mentions other changes that you weren't aware of, then the original code must stay intact. \$\endgroup\$

Jamal
– Jamal

2014-07-22 19:36:59 +00:00
Commented Jul 22, 2014 at 19:36

| Show 1 more comment

jcollado · Accepted Answer · 2014-07-22 22:39:01Z

3

A few brief comments:

Use four spaces for each indentation level
Use a space around each operator (==, >=, ...)
Use the in operator instead of the has_key method
Use subprocess.Popen instead of os.system
Use x.startswith(y) (returns a boolean directly) instead of x.find(y) == 0

A few longer comments:

I'm not sure what is the logic regarding filenames that you need to implement, but I recommend to have a look at the fnmatch module.
For the parsing of the HTTP fields, you might want to use a regular expression.
Also, a comment to make clear when a request or a response is being parsed would make the code more readable.
Rename the i variable to make clear what is being used for (is headers[i] supposed to be the same as header?).
Do not reinvent the wheel unless you need to. Check if there's an HTTP parsing library around already.

answered Jul 22, 2014 at 22:39

jcollado

1,3987 silver badges8 bronze badges

\$\begingroup\$ why the in operator is better than has_key and subprocess.Popen is better than os.system @jcollado \$\endgroup\$

Raghda Hraiz
– Raghda Hraiz

2014-07-23 09:22:25 +00:00
Commented Jul 23, 2014 at 9:22
\$\begingroup\$ According to the documentation has_key has been deprecated (in is more generic and can be used with user defined classes that implement the __contains__ method) and os.system is not as powerful as subprocess.Popen. There's a section about how to use subprocess.Popen instead of os.system here. \$\endgroup\$

jcollado
– jcollado

2014-07-23 10:38:44 +00:00
Commented Jul 23, 2014 at 10:38

Add a comment |

Stack Exchange Network

Parse HTTP header using Python and tcpflow

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Parse HTTP header using Python and tcpflow

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions