16

I'm trying to execute the following

from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    set(item)

and it takes very long compared to:

x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    item.tolist()

Why does it take much longer to convert a NumPy array to a set than to a list? I mean basically both have complexity O(n)?

7
  • 1
    converting to a set involves some hashing & computing "where" to insert the new element in the set. Converting to list is probably just a straightforward copy. Commented May 28, 2017 at 7:09
  • 2
    What's the shape of your x? dtype? I see 3 and 4 element sublists. But yes, set is a Python object similar to a dictionary (keys but no value). tolist is a compile numpy method. Commented May 28, 2017 at 7:43
  • let n be the length of item in x. In worth case there could be n item in every sublist. Is there any option to do it faster? Commented May 28, 2017 at 7:56
  • What's the purpose of doing set on each sublist in a and throwing away the result? Commented May 28, 2017 at 8:52
  • in the main application i have a 2d data set. Next i perform a range query with kd tree, which returns a np array as above. Each sublist has the indices for the neighbours from a point (therefore it could be N in each sublist). But i need the indices for the neighbours from a point as a set for next step in my application. Commented May 28, 2017 at 9:34

2 Answers 2

34
+100

TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.

To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.

A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:

enter image description here

The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).

A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:

enter image description here

This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.

But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:

enter image description here

The values 1, 2 and 3 are stored directly in the NumPy array instance.


With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:

Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).

On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:

enter image description here

Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:

import numpy as np
arr = np.arange(100000)
lst = list(range(100000))

def iterateover(obj):
    for item in obj:
        pass

%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The set "constructor" just does an iteration over the object.

One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():

arr = np.arange(100000)

%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Each of these has O(n) runtime complexity but the constant factors are very different.

In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():

arr = np.random.randint(0, 1000, (10000, 3))

def tosets(arr):
    for line in arr:
        set(line)

def tolists(arr):
    for line in arr:
        list(line)

def tolists_method(arr):
    for line in arr:
        line.tolist()

def tosets_intermediatelist(arr):
    for line in arr:
        set(line.tolist())

%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).


In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.

I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:

enter image description here

Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).

Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:

import numba as nb
@nb.njit
def tosets_numba(arr):
    for lineno in range(arr.shape[0]):
        set(arr[lineno])

tosets_numba(arr)  # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).

Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).

enter image description here

I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).

Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.


Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.

Sign up to request clarification or add additional context in comments.

9 Comments

if i use numba would it be faster iterate over the loop ?
@FelixHa Yes, numba could be faster - if you don't need the sets outside your function. I updated the answer :)
thanks it really speeds up the result!!! , have u an idea how to implement it in cython?
@FelixHa Yes, with Cython you can probably speed up the python solutions by a factor of 2 (it reduces the loop overhead and the function call overhead). But without a dedicated data structure (maybe there's some c++ data structure but I haven't checked) for the problem it won't make much sense. If you use IPython you can simply use %load_ext cython and then compile a block with %%cython. I checked the 4 functions I had in the answer and all were 1.5-2 times faster compared to pure python. But that's a lot slower than numba.
This, like many of your answers, is a very thorough answer. From time to time I go through your answers and learn a lot from them. I know you don't need this but consider this a small gesture. :)
|
1

Here is a way to speed things up: avoid the loop and use a multiprocessing pool.map trick

from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing

pool = ThreadPool(multiprocessing.cpu_count()) # get the number of CPU
y = pool.map(set,x) # apply the function to your iterable
pool.close()
pool.join()

4 Comments

Interesting, do you have any data (timings) to support the statement that it actually "speeds things up"?
chriskiehl.com/article/parallelism-in-one-line just try it, it is amazing! It changed my whole approach of coding
Since I do not think that your task takes long for computational reasons, you could event try a higher int value in ThreadPool() (e.g. for handshakes with distant servers, I use 30)
But if I try your code it's 20-500 times slower than the original code from the question (depending on the kind of input) so I was wondering what data you used or how you did your timings to justify the "speed things up" statement.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.