22

I'm sorry in advance if this is a duplicated question, I looked for this information but still couldn't find it.

Is it possible to arrange a numpy array (or python list) by using the indexes of the N biggest elements in decreasing order very efficiently?

For instance, the array:

a = array([4, 1, 0, 8, 5, 2])

The indexes of the biggest elements in decreasing order would give (considering N = 6, all the elements are included):

8 --> 3

5 --> 4

4 --> 0

2 --> 5

1 --> 1

0 --> 2

result = [3, 4, 0, 5, 1, 2]

I know how to make it using a somewhat silly approach (like sorting the array and searching for each of the N numbers for their indexes), but I was wondering if is there any efficient library like bottleneck or heapq or maybe a pythonic approach to make this very fast. I have to apply it in several arrays with 300k elements each so that's why performance is an issue.

Thanks in advance!

UPDATE

I read the answers and decided to timeit them using a 300k of random integers, here are the results:

solution 1: sorted(range(len(a)), key=lambda i:a[i]) time: 230 ms

solution 2: heapq.nlargest(len(a), zip(a, itertools.count())) time: 396 ms

solution 3: heapq.nlargest(len(a), enumerate(a), key=operator.itemgetter(1)) time: 864 ms

solution 4: def f(a,N): return np.argsort(a)[::-1][:N] (N = len(a)) time: 104 ms

Thanks a lot for the fast and very good answers!

6
  • possible duplicate of How to get the N maximum values in a numpy array? Commented Oct 8, 2012 at 20:07
  • 1
    If you follow the trail of duplicates, this answer pops up, which seems promising -- although the post is by the developer, a fact that the answer does not disclose... Commented Oct 8, 2012 at 20:09
  • In your test, what is the value of N ? As explained above, using heapq is efficient is N is rather small compare to len(a). Commented Mar 13, 2013 at 15:08
  • How do you modify these for N < len(a)? Commented Apr 27, 2015 at 11:04
  • I agree with @lizzie. Can you provide the value of N and len(a) in your experiment? I think heapq.nlargest should be more efficient than np.argsort if N is much smaller than len(a). Commented Jul 20, 2016 at 21:37

4 Answers 4

22

Have you looked at the built-in numpy argsort method?:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

I can sort an array with 300,000 random floats in about 29 ms on my machine using that method.

def f(a,N):
    return np.argsort(a)[::-1][:N]
Sign up to request clarification or add additional context in comments.

3 Comments

This worked amazingly well! In my machine it's taking 104 ms (it's very busy right now), later on I'll try it again, but so far this has been the fastest solution. Tnx!
@joshadel This function argsort's first and then returns the top-N values. Is there a Numpy/Scipy function that is equivalent to the Python heapq.nlargest(N, a) but for finding the top-N index values without argsort'ing the entire array?
@dbv maybe something like berkeleyanalytics.com/bottleneck/…, but that seems to only work for the smallest values.
11
L = [4, 1, 0, 8, 5, 2]
sorted(range(len(L)), key=lambda i:L[i])

6 Comments

More clever than me. +1 to you sir.
key = L.__getitem__ is an alternative (that may be a bit faster in some cases).
@GarethRees: You're right! I hadn't thought of that. lambdas /are/ slow
I tried a simple test and didn't see much difference, so it wouldn't be wrong to go with the lambda.
I tried to make it work using the getitem but since I'm very newbie in python couldn't make it work correctly, but the solution using lambda worked well here, thanks for the help!
|
6

You can use heapq to do this easily enough:

>>> heapq.nlargest(3, zip(a, itertools.count()))
[(8, 3), (5, 4), (4, 5)]

Tuples are sorted by sorting on the first value, then the second, etc... This means that we can simply make a tuple of (value, index) and sort, giving us the indices of the values (the values are also given, but we can easily throw these away).

I am using zip() and itertools.count() as enumerate gives us the wrong order, so they will be sorted by index, rather than by value. Alternatively, you could also do ((value, index) for index, value in enumerate(a)), but I feel that is less clear.

Another alternative is to give a key, doing heapq.nlargest(3, enumerate(a), key=operator.itemgetter(1)).

3 Comments

The python documentation recommends using sorted() instead of heapq.nlargest() when working with large lists, although they don't clarify how big a "large list" is. docs.python.org/library/heapq.html
@Matt I can't find the suggestion saying that in the docs - but that may well be the case - I'd suggest the OP runs some timeit tests to find out what's the most efficient for his use.
@Lattyware At docs.python.org/library/heapq.html it says "The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the sorted() function. Also, when n==1, it is more efficient to use the built-in min() and max() functions."
1

Another way to use heapq

heapq.nlargest(n, range(len(a)), key=a.__getitem__)

As commented elsewhere, it won't beat sorting unless a is very large and n<<len(a) because sorting is a relatively fast operation in Python. However eventually a slow O(n) will always beat the O(n*log(n))

3 Comments

Yes, you are correct that slow O(n) beats O(n*log(n)) for large n, but heapq module has been implemented intelligently, taking care that the n value passed to nlargest() function will use heap only if n is relatively small and switch to sorting when n is significantly larger and tending to sizeof list.
nlargest from at least v2.7.11 will use a heap regardless, but v3.5.2 behaves as you describe @sinister
actually, in v3.5.2, sorted is only used when n is larger than the size of the iterable @sinister

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.