1

I am trying to search a string in a pandas column. I have read that it should be fastest to sort the column first and search the string using searchsorted on the values. I figured out that this is slower than searching brute force (array == string) on a same numpy array. To see why, I have performed the following tests:

import timeit

setup4 = '''  
import numpy as np, string, random 

d =     np.array([
            u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16))
             for _ in range(20000)
             ],dtype=np.object)
'''



setup5 = '''  
import numpy as np, pandas as pd, string, random 

header = [
                    u'A',
                    u'B',
                    u'C',
                    u'D',
                    u'E',
                    u'F',
                    u'G',
                    u'H',
                    u'I',
                    u'J',
                    u'K',
                    u'L',
                    u'M',
                    u'N'
                    ]


data =     [[pd.to_datetime('20140505'),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u'sfgweorfjdfl',
                u'dsiofqjwel;dmfv',
                u'e3ruiwefjvgoiubg',
                u'3124oirjrg;klhbas',
                u';3rhfgfbnvsad3r',
                pd.to_datetime('20140505'),
                u'1234irtjurgbfas',
                u'12;rhfd;hb;oasere',
                u'124urgfdnv.,sadfg',
                u'1rfnhsdjk.dhafgsrdew',
                u'safeklrjh2nerfgsd.'
                ] for _ in range(20000)]

df = pd.DataFrame(data,columns=header)
df_sorted = df.sort(['B','C'])
e = df_sorted['B'].values
'''

setup6 = '''  
import numpy as np, pandas as pd, string, random 

header = [
                    u'A',
                    u'B',
                    u'C',
                    u'D',
                    u'E',
                    u'F',
                    u'G',
                    u'H',
                    u'I',
                    u'J',
                    u'K',
                    u'L',
                    u'M',
                    u'N'
                    ]


data =     [[pd.to_datetime('20140505'),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u'sfgweorfjdfl',
                u'dsiofqjwel;dmfv',
                u'e3ruiwefjvgoiubg',
                u'3124oirjrg;klhbas',
                u';3rhfgfbnvsad3r',
                pd.to_datetime('20140505'),
                u'1234irtjurgbfas',
                u'12;rhfd;hb;oasere',
                u'124urgfdnv.,sadfg',
                u'1rfnhsdjk.dhafgsrdew',
                u'safeklrjh2nerfgsd.'
                ] for _ in range(20000)]

df = pd.DataFrame(data,columns=header)
f = df['B'].values
'''

print(timeit.timeit("index = d == u'ASDASD123ASADKHX'", setup=setup4,number=1000))
print(timeit.timeit("index = e == u'ASDASD123ASADKHX'", setup=setup5,number=1000))
print(timeit.timeit("index = f == u'ASDASD123ASADKHX'", setup=setup6,number=1000))

With the following result:

print(timeit.timeit("index = d == u'ASDASD123ASADKHX'", setup=setup4,number=1000))
0.808505267014

print(timeit.timeit("index = e == u'ASDASD123ASADKHX'", setup=setup5,number=1000))

3.06733738226

print(timeit.timeit("index = f == u'ASDASD123ASADKHX'", setup=setup6,number=1000))
1.64207848896

My question here: Why is the performance on the pure numpy array so much better? And how could I achieve the same performance using the data extracted from the pandas table?

Thank you very much.

16
  • 1
    I believe pandas although uses numpy arrays underneath it does more dtype checking and aligning so is slower: stackoverflow.com/questions/19834075/… Commented Jun 12, 2014 at 9:34
  • Okay, but in all 3 cases, I am operating on numpy arrays. The only difference is that for the first case, the array is natively constructed as a numpy array, whereas in the later two cases, the arrays are extracted from pandas data frames using "values". Commented Jun 12, 2014 at 10:49
  • Your second setup is sorting and returning a copy of the dataframe, the third setup doesn't do this but there appears to be some overhead with constructing the dataframe and then returning the data as a numpy array. I don't know the full inner workings of pandas to explain more but it would be useful to time just the creation of the dataframe so you can get an idea of the cost of sorting and accessing the data as a numpy array via .values Commented Jun 12, 2014 at 10:57
  • Everything in setup is not counted in the timer of timeit Commented Jun 12, 2014 at 11:11
  • Ah OK, I misunderstood what your timeit was profiling exactly, yes that is odd there should not be a significant difference between them apart from the sorted dataframe versus unsorted Commented Jun 12, 2014 at 11:13

2 Answers 2

0

Every string in DataFrame is an object, what you get from df['B'].values is an object array. But when you create a string array by np.array(), it returns an array that every string use the same byte count.

Here is an example, a is an array with S10 dtype, b is an array with object dtype.

import numpy as np
import random
import string
words = ["".join(random.choice(string.ascii_uppercase) for _ in range(10)) for _ in range(10000)]
a = np.array(words)
b = a.astype("O")
%timeit a == "123"
%timeit b == "123"

the output:

10000 loops, best of 3: 122 µs per loop
10000 loops, best of 3: 164 µs per loop
Sign up to request clarification or add additional context in comments.

1 Comment

Yes, I have realized that. Therefore, the numpy array is of dtype=np.object. Also it would not explain the difference between the sorted and the unsorted array from pandas (cases 2 and 3).
0

I tested your code in IPython and got pretty much the same performance for all variants apart from the unsorted dataframe:

In [85]:

import numpy as np, string, random 

d =     np.array([
            u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16))
             for _ in range(20000)
             ],dtype=np.object)

header = [
                    u'A',
                    u'B',
                    u'C',
                    u'D',
                    u'E',
                    u'F',
                    u'G',
                    u'H',
                    u'I',
                    u'J',
                    u'K',
                    u'L',
                    u'M',
                    u'N'
                    ]


data =     [[pd.to_datetime('20140505'),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(16)),
                u'sfgweorfjdfl',
                u'dsiofqjwel;dmfv',
                u'e3ruiwefjvgoiubg',
                u'3124oirjrg;klhbas',
                u';3rhfgfbnvsad3r',
                pd.to_datetime('20140505'),
                u'1234irtjurgbfas',
                u'12;rhfd;hb;oasere',
                u'124urgfdnv.,sadfg',
                u'1rfnhsdjk.dhafgsrdew',
                u'safeklrjh2nerfgsd.'
                ] for _ in range(20000)]

df = pd.DataFrame(data,columns=header)
df_sorted = df.sort(['B','C'])
e = df_sorted['B'].values
f = df['B'].values
%timeit index = d == u'ASDASD123ASADKHX'
%timeit index = e == u'ASDASD123ASADKHX'
%timeit index = f == u'ASDASD123ASADKHX'
1000 loops, best of 3: 536 µs per loop
1000 loops, best of 3: 568 µs per loop
1000 loops, best of 3: 538 µs per loop

9 Comments

That's interesting. What happens if you use the code exactly as printed above? I have added the import timeit, so it should now be a self-consistent python file.
I ran your code and got the following: 2.11338382930262 1.2496556612022687 0.6459569358412409 which is opposite to what you observed, I then ran it again and got 0.5910921373142628 1.7401513672084548 0.5598322421719786 if you rerun your code do you get random results?
Just ran it again and got 0.5474049547920004 0.6093832207843661 0.5601899379689712, doesn't seem to be any mystery here to me
No, it is actually very consistent. The numbers vary slightly, but not too much.
I have now used my Linux PC at home to repeat the test (I have copied the code from here) and the result is the same: 1.6, 5.5, and 3.2 seconds. I am wondering what the difference between your test and mine is.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.