0

I am trying to figure out the most efficient way to search a data frame in Pandas with a list (dataframe) of other values without using brute force methods. Is there a way to vectorize it? I know I can for loop each element of the list (or dataframe) and extract the data using the loc method, but was hoping for something faster. I have a data frame with 1 million rows and I need to search within it to extract the index of 600,000 rows.

Example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'WholeList': np.round(1000000*(np.random.rand(1000000)),0)})
df2 = pd.DataFrame({'ThingsToFind': np.arange(50000)+50000})
df.loc[1:10,:]
#Edited, now that I think about it, the 'arange' method would have been better to populate the arrays.

I want the most efficient way to get the index of df2 in df, where it exists in df.

Thanks!

6
  • So, the output would be of length 1 million? Commented Apr 5, 2017 at 22:06
  • Also, what to output if there's isn't a match of df2 in df? Commented Apr 5, 2017 at 22:12
  • Did you try to use the isin() DataFrame method? Commented Apr 5, 2017 at 22:18
  • Either length would be ok now that I think about it. Commented Apr 6, 2017 at 1:37
  • @Andrew L I've mainly tried brute forcing through the loc method, but I assumed this is the most time intensive way to do it. Commented Apr 6, 2017 at 1:47

3 Answers 3

1

Pandas dataframes have an isin() method that works really well:

df[df.WholeList.isin(df2.ThingsToFind)]

It seems reasonably performant on my MBP:

CPU times: user 3 µs, sys: 5 µs, total: 8 µs
Wall time: 11 µs
Sign up to request clarification or add additional context in comments.

4 Comments

But, we need to get the indexes of df2 corresponding to the matches, right?
I guess I don't understand what you mean. There's no explicit index in df2. You looking for the row number index? um, that's simply df[df.WholeList.isin(df2.ThingsToFind)].index
@Divaker I would say that as long as I can easily have it where WholeList(index our function would provide)=ThingsToFind, then I'd be happy. I'm thinking of a MATLAB command that I love and trying to implement it in Python. Sorry if this is a newbie question, but I'm only in month 2 of the language.
I'll give isin a try. Thanks!
0

Here's an approach with np.searchsorted as it seems the second dataframe has its elements sorted and unique -

def find_index(a,b, invalid_specifier = -1):
    idx = np.searchsorted(b,a)
    idx[idx==b.size] = 0
    idx[b[idx] != a] = invalid_specifier
    return idx

def process_dfs(df, df2):
    a = df.WholeList.values.ravel()
    b = df2.ThingsToFind.values.ravel()
    return find_index(a,b, invalid_specifier=-1)

Sample run on arrays -

In [200]: a
Out[200]: array([ 3,  5,  8,  4,  3,  2,  5,  2, 12,  6,  3,  7])

In [201]: b
Out[201]: array([2, 3, 5, 6, 7, 8, 9])

In [202]: find_index(a,b, invalid_specifier=-1)
Out[202]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4])

Sample run on dataframes -

In [188]: df
Out[188]: 
    WholeList
0           3
1           5
2           8
3           4
4           3
5           2
6           5
7           2
8          12
9           6
10          3
11          7

In [189]: df2
Out[189]: 
   ThingsToFind
0             2
1             3
2             5
3             6
4             7
5             8
6             9

In [190]: process_dfs(df, df2)
Out[190]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4])

2 Comments

Thanks! This is an interesting approach.
This worked beautifully. isin() didn't give me what I wanted, but this was brilliant. Thanks!
0

I agree with @JDLong - IMO Pandas is pretty fast:

In [49]: %timeit df[df.WholeList.isin(df2.ThingsToFind)]
1 loop, best of 3: 819 ms per loop

In [50]: %timeit df.loc[df.WholeList.isin(df2.ThingsToFind)]
1 loop, best of 3: 814 ms per loop

In [51]: %timeit df.query("WholeList in @df2.ThingsToFind")
1 loop, best of 3: 837 ms per loop

1 Comment

Thanks. I assumed there were other approaches than the brute force + loc method. I'll give this a try.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.