1

Say I have a massive 2D Database shaped (1.2mil, 6).

I want to find the index of a 1D array (1, 6) in the big_DB. I actually have 64 of these vectors to search for at a time, shaped (64, 6).

Here's my code:

for data in range(64): # I have 64 1d arrays to find
    self.idx = np.where((big_DB == arrays[data]).all(axis=1))

This takes 0.043 sec (for all 64 arrays). Is there a faster method to do this? My project will call the search function over 40,000 times.

Edit) The big_DB is the result of itertools.product, unique in row, float.

20
  • What dtype do you have? Commented Mar 25, 2021 at 20:51
  • Here's a related post I made: stackoverflow.com/q/64215263/2988730. Searching is easier than sorting. Will post once you've clarified a few things. Commented Mar 25, 2021 at 21:24
  • Is big_DB sorted? Can it be? Does it get updated frequently? Are you open to using a mapping type instead? Hashing will be much faster than linear or even binary search (the 400k iterations matters) Commented Mar 25, 2021 at 21:26
  • big_DB is sorted and won't update during the project. Commented Mar 26, 2021 at 5:26
  • What dtype is it? Commented Mar 26, 2021 at 5:27

1 Answer 1

0

The fastest I've been able to get this to work is using O(1) lookup using Python's builtin dict type. You need to pre-process your DB, which may take a second or two at most, but lookups go from >100ms on my machine to <50us: an improvement by 2000x or better for all 64 lookups. You may get slightly worse results because I tested with a 100k-element database. The larger DB you have may cause more hash collisions.

To make the lookup hash-table, I turned each row of big_DB into a bytes object. This makes up the key. Values are then indices of each element, since that's how you want to do the lookup:

dt = f'V{big_DB.shape[1] * big_DB.dtype.itemsize}'
dict_db = dict(zip(map(np.void.item, np.squeeze(big_DB.view(dt))), range(len(big_DB))))

The resulting lookup is as simple as

idx = dict_db[x.view(dt).item()]
Sign up to request clarification or add additional context in comments.

3 Comments

what is x in 'idx = dict_db[x.view(dt).item()]' ?
This raise "descriptor 'item' for 'numpy.generic' objects doesn't apply to a 'numpy.ndarray' object"
In that case, you need to provide a complete MCVE. I made assumptions about how the arrays look, and apparently I was wrong.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.