3

I'm retrieving a large number of data from a database, which I later plot using a scatterplot. However, I run out of memory, and the program aborts when I am using my full data. Just for the record it takes >30 minutes to run this program, and the length of the data list is about 20-30 million.

map = Basemap(projection='merc',
resolution = 'c', area_thresh = 10,
llcrnrlon=-180, llcrnrlat=-75,
urcrnrlon=180, urcrnrlat=82)

map.drawcoastlines(color='black')
# map.fillcontinents(color='#27ae60')
with lite.connect('database.db') as con:
    start = 1406851200
    end = 1409529600
    cur = con.cursor()
    cur.execute('SELECT latitude, longitude FROM plot WHERE unixtime >= {start} AND unixtime < {end}'.format(start = start, end = end))
    data = cur.fetchall()
    y,x = zip(*data)
    x,y = map(x,y)
    plt.scatter(x,y, s=0.05, alpha=0.7, color="#e74c3c", edgecolors='none')
    plt.savefig('Plot.pdf')
    plt.savefig('Plot.png')

I think my problem may be in the zip(*) function, but I really have no clue. I'm both interested in how I can preserve more memory by rewriting my existing code, and to split up the plotting process. My idea is to split the time period in half, then just do the same thing twice for the two time periods before saving the figure, however I am unsure on this will help me at all. If the problem is to actually plot it, I got no idea.

7
  • 1
    Just outa curiosity, what is the output of len(data)? Commented May 7, 2015 at 18:17
  • 1
    O.O in that case you could try "streaming" the data? Processing a few hundred plot points at a time until you have the full picture instead of loading all 30 million into memory? Commented May 7, 2015 at 18:33
  • 1
    Do you have duplicate lon, lat points in the database? Commented May 7, 2015 at 18:37
  • 1
    I found solutions based on matplotlib were very slow, I prefer to use mapnik for drawing maps, much quicker and sometimes much nicer. Commented May 7, 2015 at 18:39
  • 1
    Instead of getting the whole time interval at once, how about breaking it up into a bunch of smaller intervals and adding each interval to the plot one at a time. You will still run out of memory, but at least you can see how far it is getting before you do. Commented May 7, 2015 at 18:43

1 Answer 1

2

If you think the problem lies in the zip function, why not use a matplotlib array to massage your data into the right format? Something like this:

data = numpy.array(cur.fetchall())
lat = data[:,0]
lon = data[:,1]
x,y = map(lon, lat)

Also, your generated PDF will be very large and slow to render by the various PDF readers because it is a vectorized format by default. All your millions of data points will be stored as floats and rendered when the user opens the document. I recommend that you add the rasterized=True argument to your plt.scatter() call. This will save the result as a bitmap inside your PDF (see the docs here)

If this all doesn't help, I would investigate further by commenting out lines starting at the back. That is, first comment out plt.savefig('Plot.png') and see if the memory use goes down. If not, comment out the line before that, etc.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.