2

I am trying to filter a dataframe by column values, but I do not get it. Let suppose I have the following dataframe:

Index Column1 Column2
1      path1   ['red']
2      path2   ['red' 'blue']
3      path3   ['blue']

My dataframe has exactly that format. I want to create a sub-dataframe with the rows containing only ['red'] in Column2. That would be just the first row.

What I tried so far, among other approaches, is:

classes = ['red']
df=df.loc[df['Column2'].isin(classes)]

But it does not work. I get this warning and just remains unchanged:

FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison f = lambda x, y: htable.ismember_object(x, values)

How could it done correctly? Thanks.

Edit: I think I did not explain myself very good.

My data, for example ['red' 'blue'] does not have comma in the middle. Is type 'object'. I would like to filter the original dataframe in such a way, it shows the rows with the column 'Column2' containing, for example, red. In that case, it would show me rows 1 and 2. Is that possible?

2 Answers 2

3

One possible solution is compare sets, advatage is ordering in sets with length > 1 is not important:

import ast
df['Column2'] = df['Column2'].str.replace(' ', ', ').apply(ast.literal_eval)

Alternative:

df['Column2'] = df['Column2'].fillna("''").str.findall(r"'(.+?)'")

classes = ['red']
df1 = df[~df.Column2.map(set(classes).isdisjoint)]
print (df1)

0      1   path1        [red]
1      2   path2  [red, blue]
Sign up to request clarification or add additional context in comments.

12 Comments

Good to know I am not the only one who thought that OPs dataframe looks exactly the way he mentioned (no comma between red and blue). That is why I deleted my answer (I had assumed he made a typo) initially too. Anyway, I like this solution, couldnt think of that. Thanks. :)
Actually, that is the point! In my dataframe it does not have comma, is like I wrote initially (a mod edited my question putting the comma). What I have is data of type "object" and is like ['red' 'blue']. Thank you both, I am going to try such solutions and I will come back.
Do you also have it as ['red' 'blue'] and not [red blue] <- without quotes
@EAlvarado - Is possible use df['Column2'] = df['Column2'].str.replace(' ', ', ').apply(ast.literal_eval) before my solution?
@Ankur Sinha - They have quotes (single quote) @jezrael - I tried but I have the following error ValueError: malformed node or string: nan. I would like actually to keep the original format if possible.
|
1

Your dataframe after reproducing exactly the same way:

df = pd.DataFrame()
df['Index'] = [1, 2, 3]
df['Column1'] = ['path1', 'path2', 'path3']
df['Column2'] = ['[\'red\']', '[\'red\' \'blue\']', '[\'blue\']']

Dataframe:

   Index Column1         Column2
0      1   path1         ['red']
1      2   path2  ['red' 'blue']
2      3   path3        ['blue']


Possible Solution

You can try doing this by replacing [, ] and ':

df['Column2'] = df['Column2'].str.replace('[', '')
df['Column2'] = df['Column2'].str.replace(']', '')
df['Column2'] = df['Column2'].str.replace('\'', '')

Now do:

classes = ['red']
df = df[df.Column2.str.contains(''.join(classes))]

Output:

   Index Column1   Column2
0      1   path1       red
1      2   path2  red blue

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.