Vectorized method for mapping a list from one Dataframe row to another Dataframe row

Question

Given a dataframe df1 table that maps ids to names:

         id
names   
a      535159
b      248909
c      548731
d      362555
e      398829
f      688939
g      674128

and a second dataframe df2 which contains lists of names:

    names      foo
0   [a, b, c]   9
1   [d, e]     16
2   [f]         2
3   [g]         3

What would be the vectorized method for retrieve the ids from df1 for each list item in each row like this?

names           foo             ids
0   [a, b, c]    9     [535159, 248909, 548731]
1   [d, e]      16     [362555, 398829]
2   [f]          2     [688939]
3   [g]          3     [674128]

This is a working method to achieve the same result using apply:

import pandas as pd
import numpy as np

mock_uids = np.random.randint(100000, 999999, size=7)

df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df1 = df1.set_index('names')


def with_apply(row):
    row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
    return row

df2 = df2.apply(with_apply, axis=1)

jezrael · Accepted Answer · 2020-12-10 07:08:24Z

I think vecorize this is really hard, one idea for improve performance is map by dictionary - solution use if y in d for working if no match in dictioanry:

df1 = df1.set_index('names')

d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]

If all values match:

d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x] for x in df2['names']]

Test for 4k rows:

np.random.seed(2020)

mock_uids = np.random.randint(100000, 999999, size=7)

df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df2 = pd.concat([df2] * 1000, ignore_index=True)

df1 = df1.set_index('names')

def with_apply(row):
    row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
    return row

In [8]: %%timeit
   ...: df2.apply(with_apply, axis=1)
   ...: 
928 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %%timeit
   ...: d = df1['id'].to_dict()
   ...: df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
   ...: 
4.25 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [10]: %%timeit
    ...: df2['ids3'] = list(df1.loc[name]['id'].values for name in df2['names'])
    ...: 
    ...: 
1.66 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

interesting approach, would love to profile this against other answers if they arrive
@lys - add timings to my answer. Seems your solution is slow.
ah that's great! that's really clear that your solution is much faster/better than the other two (~250x !), and an approach I'd never even considered. thanks :)

Chris · Accepted Answer · 2020-12-10 07:10:48Z

One way using operator.itemgetter:

from operator import itemgetter

def listgetter(x):
    i = itemgetter(*x)(d)
    return list(i) if isinstance(i, tuple) else [i]

d = df.set_index("name")["id"]
df2["ids"] = df2["names"].apply(listgetter)

Output:

       names  foo                       ids
0  [a, b, c]    9  [535159, 248909, 548731]
1     [d, e]   16          [362555, 398829]
2        [f]    2                  [688939]
3        [g]    3                  [674128]

Benchmark on 100k rows:

d = df.set_index("name")["id"] # Common item
df2 = pd.concat([df2] * 25000, ignore_index=True)

%%timeit
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]

# 453 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df2["ids2"] = df2["names"].apply(listgetter)

# 349 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df2['ids2'] = [[d[y] for y in x] for x in df2['names']]

# 371 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@jezrael itemgetter is still slightly faster but given the fact that it involves importing and defining new function, i would rather stick to your new answer ;). But I still like itemgetter when the speed is important.
this is super interesting, I'd not heard of this module - would be great to test this on different datasets, perhaps it performs better in some cases and not others. Thanks @Chris - I'll give it a go

lys · Accepted Answer · 2020-12-10 06:54:18Z

0

this seems to work:

df2['ids'] = list(df1.loc[name]['id'].values for name in df2['names'])

interested to know if this is the best approach

answered Dec 10, 2020 at 6:54

lys

1,0593 gold badges15 silver badges37 bronze badges

Collectives™ on Stack Overflow

Vectorized method for mapping a list from one Dataframe row to another Dataframe row

3 Answers 3

3 Comments

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Related