4

I have a very large array, consisting of integers between 0 and N, where each value occurs at least once.

I'd like to know, for each value k, all the indices in my array where the array's value equals k.

For example:

arr = np.array([0,1,2,3,2,1,0])
desired_output = {
    0: np.array([0,6]),
    1: np.array([1,5]),
    2: np.array([2,4]),
    3: np.array([3]),
    }

Right now I am accomplishing this with a loop over range(N+1), and calling np.where N times.

indices = {}
for value in range(max(arr)+1):
    indices[value] = np.where(arr == value)[0]

This loop is by far the slowest part of my code. (Both the arr==value evaluation and the np.where call take up significant chunks of time.) Is there a more efficient way to do this?

I also tried playing around with np.unique(arr, return_index=True) but that only tells me the very first index, rather than all of them.

2
  • You can iterate over unique items in the array instead of looping using range. Create a set out of the array and iterate over that set. It can reduce the overhead. Commented Aug 18, 2016 at 8:59
  • @nishparadox2 That's not going to do anything, because I already know what the unique values are: 0 through N, which you get more efficiently with a range than with a call to unique(). Commented Aug 18, 2016 at 9:05

4 Answers 4

7

Approach #1

Here's a vectorized approach to get those indices as list of arrays -

sidx = arr.argsort()
unq, cut_idx = np.unique(arr[sidx],return_index=True)
indices = np.split(sidx,cut_idx)[1:]

If you want the final dictionary that corresponds each unique element to their indices, finally we can use a loop-comprehension -

dict_out = {unq[i]:iterID for i,iterID in enumerate(indices)}

Approach #2

If you are just interested in the list of arrays, here's an alternative meant for performance -

sidx = arr.argsort()
indices = np.split(sidx,np.flatnonzero(np.diff(arr[sidx])>0)+1)
Sign up to request clarification or add additional context in comments.

3 Comments

That is quite clever. I'll give this a try.
So this is about 6.5 times faster than the answer I actually accepted. Given that the cleverness does obfuscate what the purpose of the code is, I'll stick to the simpler solution, but it definitely deserves an upvote and some props. :)
@acdr I appreciate any feedback on runtimes, as I worry about that a lot. Thanks for bringing in those upvotes ;)
3

A pythonic way is using collections.defaultdict():

>>> from collections import defaultdict
>>> 
>>> d = defaultdict(list)
>>> 
>>> for i, j in enumerate(arr):
...     d[j].append(i)
... 
>>> d
defaultdict(<type 'list'>, {0: [0, 6], 1: [1, 5], 2: [2, 4], 3: [3]})

And here is a Numpythonic way using a dictionary comprehension and numpy.where():

>>> {i: np.where(arr == i)[0] for i in np.unique(arr)}
{0: array([0, 6]), 1: array([1, 5]), 2: array([2, 4]), 3: array([3])}

And here is a pure Numpythonic approach if you don't want to involve the dictionary:

>>> uniq = np.unique(arr)
>>> args, indices = np.where((np.tile(arr, len(uniq)).reshape(len(uniq), len(arr)) == np.vstack(uniq)))
>>> np.split(indices, np.where(np.diff(args))[0] + 1)
[array([0, 6]), array([1, 5]), array([2, 4]), array([3])]

4 Comments

My input array is rather large, so I'd expected this solution to be even slower than mine (considering it loops over all elements) but it turns out to be very fast. Considering it's simpler than the other answers, I'll accept this one.
Well this deserves an upvote too given it's simpler as OP stated! :)
@acdr I just gave this answer based on your expected output which is a dictionary, Check out the answer for only categorizing the indices in numpy.
@Divakar Forza KISS principal!
1

I don't know numpy but you could definitely do this in one iteration, with a defaultdict:

indices = defaultdict(list)
for i, val in enumerate(arr):
    indices[val].append(i)

Comments

0

Fully vectorized solution using the numpy_indexed package:

import numpy_indexed as npi
k, idx = npi.groupy_by(arr, np.arange(len(arr)))

On a higher level; why do you need these indices? Subsequent grouped-operations can usually be computed much more efficiently using the group_by functionality [eg, npi.group_by(arr).mean(someotherarray)], without explicitly computing the indices of the keys.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.