0

After a while searching, I can't find an answer to what must be a common issue, so pointers welcomed.

I have a dataframe:

df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5], 'C' : [['a','b'],['b','c'] ,['g','h'],['x','y']]})

and I want to select a sub-set of that (some of the rows) which have values in the lists in the 'C' column which appear in a list of things i'm interested in. e.g.

listOfInterestingThings = [a, g]

so when the filter is applied I would have a df1:

df1 = 
A  B      C    
5  1  ['a','b']
3  3  ['g','h']

The dataframe I'm dealing with is a massive raw data import to RAM ~12GB in the current df form. About half that on disk as a series of json files.

2
  • 3
    Standard warning: non-scalar elements (such as lists) in Series and DataFrames don't have good support and are likely to lead to mysterious and unexpected behaviour. Caveat utilitor! Commented Apr 21, 2017 at 17:01
  • @DSM interesting. had no idea. Do you have suggestions. What i'm doing is basic manipulation of a large text corpus prior to trying out some ML to train on topics. The data ~6GB of json files. Each doc is represented by a json element with tags for 'body' and 'topics', the topics are presented as a list e.g. ['topic1', 'topic2'] I load data to a df with pd.DataFrame.from_dict. Do you have a suggestion about a better way to manipulate large datasets with this structure? Commented Apr 21, 2017 at 17:11

2 Answers 2

2

I fully agree with @DSM.

As a last resort you can use this:

In [21]: df.loc[pd.DataFrame(df.C.values.tolist(), index=df.index) \
                  .isin(listOfInterestingThings).any(1)]
Out[21]:
   A  B       C
0  5  1  [a, b]
2  3  3  [g, h]

or:

In [11]: listOfInterestingThings = set(['a', 'g'])

In [12]: df.loc[df.C.apply(lambda x: len(set(x) & listOfInterestingThings) > 0)]
Out[12]:
   A  B       C
0  5  1  [a, b]
2  3  3  [g, h]

Explanation:

In [22]: pd.DataFrame(df.C.values.tolist(), index=df.index)
Out[22]:
   0  1
0  a  b
1  b  c
2  g  h
3  x  y

In [23]: pd.DataFrame(df.C.values.tolist(), index=df.index).isin(listOfInterestingThings)
Out[23]:
       0      1
0   True  False
1  False  False
2   True  False
3  False  False
Sign up to request clarification or add additional context in comments.

Comments

1

This also works:

df[list(np.any(('a' in i) | ('g' in i) for i in df.C.values))]

   A  B       C
0  5  1  [a, b]
2  3  3  [g, h]

Benchmarks:

time df.loc[df.C.apply(lambda x: len(set(x) & listOfInterestingThings)> 0)]

CPU times: user 873 µs, sys: 193 µs, total: 1.07 ms
Wall time: 987 µs

time df[list(np.any(('a' in i) | ('g' in i) for i in df.C.values))]

CPU times: user 1.02 ms, sys: 224 µs, total: 1.24 ms
Wall time: 1.08 ms

time df.loc[pd.DataFrame(df.C.values.tolist(), index=df.index).isin(listOfInterestingThings).any(1)]

CPU times: user 2.58 ms, sys: 1.01 ms, total: 3.59 ms
Wall time: 5.41 ms

So, in short, @MaxU's answer is the quickest method.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.