0

What's the best way to validate that the values in a column in a data frame match those in a given list (and, if not, are set to a value accordingly)?

I'm trying to replace invalid values for a field ("B") with the value "(not set)".

Here's what I've tried. The goal is to replace 'f' in 'B' with "(not set)". The trouble is validating against a list of values rather than a single boolean expression.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6], 'B': ['a', 'b', 'f', 'a', 'b', 'c']})
valid_values = ['a', 'b', 'c', 'd']
validated_data = df.'B' not in [valid_values]
column_name = 'B'
df.loc[validated_data, column_name] = "(not set)"

The above taken from here: Conditional Replace Pandas

I've also tried:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6], 'B': ['a', 'b', 'f', 'a', 'b', 'c']})
valid_values = ['a', 'b', 'c', 'd']

for x in df.B:
    if x not in valid_values:
        df.B = "(not set)"
...

However, for both I get the error "The truth value of a Series is ambiguous." I tried the recommendations at the following post to no avail.

Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

Other recommendations greatly appreciated!

2
  • You will get more and better answers if you create a Minimal, Complete, and Verifiable example. Especially make sure that the input and expected data are complete (not pseudo-data), and can be easily cut and and paste into an editor to allow testing proposed solutions. Commented Jan 3, 2018 at 3:42
  • 1
    Thank you. Error's answer below does this, but I will also edit the above for future passers by. Commented Jan 3, 2018 at 17:54

1 Answer 1

1

You need to use the isin method of Series and DataFrame objects.

Consider this example:

df = pd.DataFrame({'A': [1, 2, 3,4,5,6], 'B': ['a', 'b', 'f','a','b','c']})

Resulting in this DataFrame:

    A   B
0   1   a
1   2   b
2   3   f
3   4   a
4   5   b
5   6   c

Let's say we want to remove invalid values in column B and those invallid values are a en b.

When we use the isin method we can run this code:

mask = df['B'].isin(['a','b'])
df.loc[mask,'B'] = 'not set'

Resulting in this DataFrame:

    A   B
0   1   not set
1   2   not set
2   3   f
3   4   not set
4   5   not set
5   6   c

The isin method checks if the values in a Series or DataFrame object are in the given iterable. See the docs for more information.

Sign up to request clarification or add additional context in comments.

1 Comment

Great, thank you very much. I adjusted mask = df['B'].isin(['a','b']) to be mask = ~df['B'].isin(['a','b']) so that it is checking if a value is NOT in the given list.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.