1

I am running a script that downloads xls file with html tags in it and strips them to create a clean csv file.

Code:

#!/usr/bin/env python

from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
import sys
#from pympler.asizeof import asizeof
from pympler import muppy
from pympler import summary

f = urlopen('http://localhost/Classes/sample.xls') #This is 75KB
#f = urlopen('http://supplier.com/xmlfeed/products.xls') #This is 75MB
soup = BeautifulSoup(f)
stable = soup.find('table')
print 'table found'
rows = []
for row in stable.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('th')])
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])

#print sys.getsizeof(rows)
#print asizeof(rows)

print 'row list created'
soup.decompose()
print 'soup decomposed'
f.close()
print 'file closed'

with open('output_file.csv', 'wb') as file:
    writer = csv.writer(file)
    print 'writer started'
    #writer.writerow(headers)
    writer.writerows(row for row in rows if row)

all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
summary.print_(sum1)    

The above code works perfectly for 75KB file, however the process gets killed without any error for 75MB file.

I am very new to beautiful soup and python, please help me identifying the problem. The script is running on 3GB RAM.

Output for small file is:

table found
row list created
soup decomposed
file closed
writer started
                                types |   # objects |   total size
===================================== | =========== | ============
                                 dict |        5615 |      4.56 MB
                                  str |        8457 |    713.23 KB
                                 list |        3525 |    375.51 KB
  <class 'bs4.element.NavigableString |        1810 |    335.76 KB
                                 code |        1874 |    234.25 KB
              <class 'bs4.element.Tag |        3097 |    193.56 KB
                              unicode |        3102 |    182.65 KB
                                 type |         137 |    120.95 KB
                   wrapper_descriptor |        1060 |     82.81 KB
           builtin_function_or_method |         718 |     50.48 KB
                    method_descriptor |         580 |     40.78 KB
                              weakref |         416 |     35.75 KB
                                  set |         137 |     35.04 KB
                                tuple |         431 |     31.56 KB
                  <class 'abc.ABCMeta |          20 |     17.66 KB

I dont understand what is "dict", it is taking much more memory for 75KB file.

Thank you,

1 Answer 1

2

It is difficult to say without having an actual file to work with, but what you can do is to avoid creating an intermediate list of rows and write directly to the opened csv file.

Also, you can let BeautifulSoup use lxml.html under the hood (lxml should be installed).

Improved code:

#!/usr/bin/env python

from urllib2 import urlopen
import csv

from bs4 import BeautifulSoup    

f = urlopen('http://localhost/Classes/sample.xls')
soup = BeautifulSoup(f, 'lxml')

with open('output_file.csv', 'wb') as file:
    writer = csv.writer(file)

    for row in soup.select('table tr'):
        writer.writerows(val.text.encode('utf8') for val in row.find_all('th') if val)
        writer.writerows(val.text.encode('utf8') for val in row.find_all('td') if val)

soup.decompose()
f.close()
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.