Select parts of pandas dataframe based on values in a list in a column

Question

After a while searching, I can't find an answer to what must be a common issue, so pointers welcomed.

I have a dataframe:

df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5], 'C' : [['a','b'],['b','c'] ,['g','h'],['x','y']]})

and I want to select a sub-set of that (some of the rows) which have values in the lists in the 'C' column which appear in a list of things i'm interested in. e.g.

listOfInterestingThings = [a, g]

so when the filter is applied I would have a df1:

df1 = 
A  B      C    
5  1  ['a','b']
3  3  ['g','h']

The dataframe I'm dealing with is a massive raw data import to RAM ~12GB in the current df form. About half that on disk as a series of json files.

Standard warning: non-scalar elements (such as lists) in Series and DataFrames don't have good support and are likely to lead to mysterious and unexpected behaviour. Caveat utilitor! — DSM
– DSM, Commented Apr 21, 2017 at 17:01
@DSM interesting. had no idea. Do you have suggestions. What i'm doing is basic manipulation of a large text corpus prior to trying out some ML to train on topics. The data ~6GB of json files. Each doc is represented by a json element with tags for 'body' and 'topics', the topics are presented as a list e.g. ['topic1', 'topic2'] I load data to a df with pd.DataFrame.from_dict. Do you have a suggestion about a better way to manipulate large datasets with this structure? — Peter Coghill
– Peter Coghill, Commented Apr 21, 2017 at 17:11

Community · Accepted Answer · 2017-05-23 12:25:57Z

I fully agree with @DSM.

As a last resort you can use this:

In [21]: df.loc[pd.DataFrame(df.C.values.tolist(), index=df.index) \
                  .isin(listOfInterestingThings).any(1)]
Out[21]:
   A  B       C
0  5  1  [a, b]
2  3  3  [g, h]

or:

In [11]: listOfInterestingThings = set(['a', 'g'])

In [12]: df.loc[df.C.apply(lambda x: len(set(x) & listOfInterestingThings) > 0)]
Out[12]:
   A  B       C
0  5  1  [a, b]
2  3  3  [g, h]

Explanation:

In [22]: pd.DataFrame(df.C.values.tolist(), index=df.index)
Out[22]:
   0  1
0  a  b
1  b  c
2  g  h
3  x  y

In [23]: pd.DataFrame(df.C.values.tolist(), index=df.index).isin(listOfInterestingThings)
Out[23]:
       0      1
0   True  False
1  False  False
2   True  False
3  False  False

gold_cy · Accepted Answer · 2017-04-21 17:49:46Z

This also works:

df[list(np.any(('a' in i) | ('g' in i) for i in df.C.values))]

   A  B       C
0  5  1  [a, b]
2  3  3  [g, h]

Benchmarks:

time df.loc[df.C.apply(lambda x: len(set(x) & listOfInterestingThings)> 0)]

CPU times: user 873 µs, sys: 193 µs, total: 1.07 ms
Wall time: 987 µs

time df[list(np.any(('a' in i) | ('g' in i) for i in df.C.values))]

CPU times: user 1.02 ms, sys: 224 µs, total: 1.24 ms
Wall time: 1.08 ms

time df.loc[pd.DataFrame(df.C.values.tolist(), index=df.index).isin(listOfInterestingThings).any(1)]

CPU times: user 2.58 ms, sys: 1.01 ms, total: 3.59 ms
Wall time: 5.41 ms

So, in short, @MaxU's answer is the quickest method.

Collectives™ on Stack Overflow

Select parts of pandas dataframe based on values in a list in a column

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related