-1

I have 40 data sets, each about 115MB in size, and I would like to plot them all together on the same plot in log log scale.

# make example data 
import numpy as np
data_x = []
data_y = []
for _ in range(40):
    x, y = np.random.random(size = (2, int(7e6))) # 7e6 chosen to make about 115MB size
    data_x.append(x)
    data_y.append(y)
del x, y

# now show the size of one set in MB
print((data_x[0].nbytes + data_y[0].nbytes)/1e6, 'MB')
# 112.0 MB

My computer has about 30GB of available ram, so I fully expect the 40*112MB = 4.5GB to fit.

I would like to make an overlaid log log plot of every data set:

import matplotlib.pyplot as plt 
for x,y in zip(data_x, data_y):
    plt.loglog(x, y)
plt.show()

But the memory overhead is too large. I'd prefer not to downsample the data. Is there a way I might reduce the memory overhead in order to plot this 4.5GB of data ?

I would prefer to keep the for loop as I need to modify the point style and color of each plot in it, so to concatenate the datasets is unfavorable.

The most similar question I could find is here, but this differs in that the loop is used to create distinct plots, instead of to add to the same plot, so adding a plt.clf() command into the loop does not help me.

5
  • This sound like the definition of overplotting. Maybe you should bin your data? There is no way, that displaying that amount of points yields any value Commented Mar 11, 2019 at 23:39
  • Yeah, I mean I could bin, but I'd have to use an exponentially growing bin size, since my data spans multiple orders of magnitude. Definitely not a quick and dirty solution. It'd be much simpler if there were a clean matplotlib capability to control memory overhead, such as sequentially plotting onto the png output of the previous call, for example. I'm just asking the community if an option exists. Commented Mar 11, 2019 at 23:42
  • What about matplotlib makes writing 4.5GB of data into a 500x500 pixel image cost more than 4.5GB in overhead? I'm just thinking I'm missing something... Commented Mar 11, 2019 at 23:47
  • 2
    You can expect the matplotlib figure object and its children to become much larger than the raw bytesize of the data. I don't think you're missing something. Your data is just too large to be plotted with matplotlib. I would definitely consider some sort of binning. Commented Mar 12, 2019 at 0:12
  • Thanks guys. I'll pursue the binning. Please see stackoverflow.com/questions/55112430/… if you have time. Thx Commented Mar 12, 2019 at 0:43

1 Answer 1

1

Here is my attempt at solving your problem:

# make example data 
import numpy as np
import matplotlib.pyplot as plt
import colorsys

data_x = np.random.random((40, int(7e6)))*np.logspace(0, 7, 40)[:, None]
data_y = np.random.random((40, int(7e6)))*np.logspace(0, 7, 40)[:, None]

# now show the size of one set in MB
print((data_x[0].nbytes + data_y[0].nbytes)/1e6, 'MB')

x, y = np.log(data_x), np.log(data_y)

hists = [np.histogram2d(x_, y_, bins=1000) for x_, y_ in zip(x,y)]

N = len(hists)

for i, h in enumerate(hists):
    color = colorsys.hsv_to_rgb(i/N, 1, 1)
    rows, cols = np.where(h[0]>0)
    plt.scatter(h[1][rows], h[2][cols], color=color, s=1)

Result:

enter image description here

I take the log of both the x and y data and then proceed to bin it. As I don't think, that you are interested in densities, I just plotted a static color, where a bin contains more than one element.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks @user8408080 -- this does work! but I need more control over the bin size. How to generate the bins more carefully? I'd like one bin between 0 and 10, then ten bins between 10^k and 10^{k+1} for all k>0
As stated in the docs, you can set your own edges for the bins

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.