3

Given a dataframe df1 table that maps ids to names:

         id
names   
a      535159
b      248909
c      548731
d      362555
e      398829
f      688939
g      674128

and a second dataframe df2 which contains lists of names:

    names      foo
0   [a, b, c]   9
1   [d, e]     16
2   [f]         2
3   [g]         3

What would be the vectorized method for retrieve the ids from df1 for each list item in each row like this?

names           foo             ids
0   [a, b, c]    9     [535159, 248909, 548731]
1   [d, e]      16     [362555, 398829]
2   [f]          2     [688939]
3   [g]          3     [674128]

This is a working method to achieve the same result using apply:

import pandas as pd
import numpy as np

mock_uids = np.random.randint(100000, 999999, size=7)

df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df1 = df1.set_index('names')


def with_apply(row):
    row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
    return row

df2 = df2.apply(with_apply, axis=1)

3 Answers 3

2

I think vecorize this is really hard, one idea for improve performance is map by dictionary - solution use if y in d for working if no match in dictioanry:

df1 = df1.set_index('names')

d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]

If all values match:

d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x] for x in df2['names']]

Test for 4k rows:

np.random.seed(2020)

mock_uids = np.random.randint(100000, 999999, size=7)

df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df2 = pd.concat([df2] * 1000, ignore_index=True)

df1 = df1.set_index('names')

def with_apply(row):
    row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
    return row

In [8]: %%timeit
   ...: df2.apply(with_apply, axis=1)
   ...: 
928 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %%timeit
   ...: d = df1['id'].to_dict()
   ...: df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
   ...: 
4.25 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [10]: %%timeit
    ...: df2['ids3'] = list(df1.loc[name]['id'].values for name in df2['names'])
    ...: 
    ...: 
1.66 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

3 Comments

interesting approach, would love to profile this against other answers if they arrive
@lys - add timings to my answer. Seems your solution is slow.
ah that's great! that's really clear that your solution is much faster/better than the other two (~250x !), and an approach I'd never even considered. thanks :)
1

One way using operator.itemgetter:

from operator import itemgetter

def listgetter(x):
    i = itemgetter(*x)(d)
    return list(i) if isinstance(i, tuple) else [i]

d = df.set_index("name")["id"]
df2["ids"] = df2["names"].apply(listgetter)

Output:

       names  foo                       ids
0  [a, b, c]    9  [535159, 248909, 548731]
1     [d, e]   16          [362555, 398829]
2        [f]    2                  [688939]
3        [g]    3                  [674128]

Benchmark on 100k rows:

d = df.set_index("name")["id"] # Common item
df2 = pd.concat([df2] * 25000, ignore_index=True)

%%timeit
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]

# 453 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df2["ids2"] = df2["names"].apply(listgetter)

# 349 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df2['ids2'] = [[d[y] for y in x] for x in df2['names']]

# 371 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3 Comments

@jezrael itemgetter is still slightly faster but given the fact that it involves importing and defining new function, i would rather stick to your new answer ;). But I still like itemgetter when the speed is important.
ya, now difference 20ms ;) So depends of data.
this is super interesting, I'd not heard of this module - would be great to test this on different datasets, perhaps it performs better in some cases and not others. Thanks @Chris - I'll give it a go
0

this seems to work:

df2['ids'] = list(df1.loc[name]['id'].values for name in df2['names'])

interested to know if this is the best approach

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.