Apply function for multiple dataframes and columns

Question

Hello I am working with two dataframes, and need to apply a custom-made function but I'm getting the following error: ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0'). I know why this is happening, but don't know how to solve the problem.

The first dataframe contains a list of all workable days for the current year:

print(df_workable)
          Date  workable_day  inv_workable_day  day  month
1   2019-01-02           1.0              22.0    2      1
2   2019-01-03           2.0              21.0    3      1
3   2019-01-04           3.0              20.0    4      1
6   2019-01-07           4.0              19.0    7      1
7   2019-01-08           5.0              18.0    8      1
..         ...           ...               ...  ...    ...
364 2019-12-31          20.0               1.0   31     12

The second dataframe contains data regarding some day values and a flag.

print(df)
       day_a1     wday_a1     iwday_a1       flag
0        24.0         4.0          6.0        2.1
1         NaN         NaN          NaN        NaN
3        31.0        22.0          1.0        2.2
4        27.0        18.0          5.0  3.3.2.1.3
26816    25.0        19.0          5.0          1
26817    31.0         NaN          NaN        3.2

I'm trying to apply a function that will return a date from either dataframe depending on multiple conditions (but I'm just using "this" and "that" for simplicity). This is the function:

def rec_date(row):
    if row['flag'] == '2.1':
        if df_workable[df_workable['workable_day'] == int(row['wday_a1']) & df_workable['month'] == 1]['day'] <= dt.datetime.today().day:
            val = "this"
        else:
            val = "that"
    else:
        val = "Still missing"
    return val

The issue is when I'm trying to solve condition 2.1 that I need to iterate over each row of df and check a condition. The issure arises, because when it's trying to iterate over each row, it doesn't know which row on df_workable to iterate over, so it needs an extra argument (.all(),.any(),etc...). However I do not wish to iterate, but simply extract the value corresponding to:

df_workable[df_workable['workable_day'] == 4 & df_workable['month'] == 1]['day']

(I'm passing 4 hard-coded because it would be the first value passed from df['wday_a1']). And the output for that should be 7. And that value compared to dt.datetime.today().day which is 10, would return true. I've tested both functions individually and they do return the expected output. However, the problem arises when applying these function over the dataframe, because of (I believe) the reasons explained above. After passing the function I expect to have this:

df['rec_date'] = df.apply(rec_date,axis=1)
           day_a1     wday_a1     iwday_a1       flag         rec_date
    0        24.0         4.0          6.0        2.1             this
    1         NaN         NaN          NaN        NaN    Still missing
    3        31.0        22.0          1.0        2.2    Still missing
    4        27.0        18.0          5.0  3.3.2.1.3    Still missing
    26816    25.0        19.0          5.0          1    Still missing
    26817    31.0         NaN          NaN        3.2    Still missing

Arno Maeckelberghe · Accepted Answer · 2019-10-10 13:51:14Z

There are two small issues with your code:

You want to combine two conditions with &, but you should wrap each of these conditions in parenthesis to clearly seperate them: (x==...) & (y=...)
The result of this check has the form of a Series (with only one observation in it). Python is not sure how to convert this Series of booleans into one boolean because in case the series has multiple values it doesn't know how to aggregate them (should the Series only result in a single True if all values are True or is it enough if at least one of them is True, ...). Therefore you should clarify that by adding series.all() or series.any() to your check.

def rec_date(row):
    if row['flag'] == '2.1':
        if (df_workable[(df_workable['workable_day'] == int(row['wday_a1'])) & (df_workable['month'] == 1)]['day'] <= dt.datetime.today().day).all():
            val = "this"
        else:
            val = "that"
    else:
        val = "Still missing"
    return val

Output:

       day_a1  wday_a1  iwday_a1       flag       rec_date
0        24.0      4.0       6.0        2.1           this
1         NaN      NaN       NaN        NaN  Still missing
3        31.0     22.0       1.0        2.2  Still missing
4        27.0     18.0       5.0  3.3.2.1.3  Still missing
26816    25.0     19.0       5.0          1  Still missing
26817    31.0      NaN       NaN        3.2  Still missing

Alright, I understand what you are saying. The first part, returns a series with a single value (because there's only one day/month combination per year) compared to a single value (which is today's day) does raise the issue. Furthermore from what I understand and given this particular case (the series has only 1 value) using .any() and .all() would have the same effect right? Thanks for your answer.
Your statement is partly correct. It is not comparing the Series to a single value that raises the issue. That part will work, it will just return a Series of booleans. The issue is that when you run if Series: that it will fail, because your if statement needs a single boolean. By default a pd.Series of booleans will not convert to a single boolean even if length is one so you have to explicitely tell Python how to assert whether that series 'is' True. Glad I could help :) good luck!

felipecgonc · Accepted Answer · 2019-10-10 13:56:24Z

So, let's break this statement down:

df_workable[df_workable['workable_day'] == 4 & df_workable['month'] == 1]['day']

df_workable: the complete DataFrame
df_workable[df_workable['workable_day'] == int(row['wday_a1']) & df_workable['month'] == 1]: you are filtering the DataFrame based on specific values for workable_day and month. This returns a new DataFrame, with the filtered results of the whole DataFrame.
df_workable[df_workable['workable_day'] == int(row['wday_a1']) & df_workable['month'] == 1]['day']: this takes the DataFrame returned in step 2 and accesses its ['day'] column. This returns a pandas.Series object, which contains all values for the DataFrame's daycolumn.

Which means, when you do df_workable[df_workable['workable_day'] == int(row['wday_a1']) & df_workable['month'] == 1]['day'] <= dt.datetime.today().day, you are trying to compare a whole Series object (which contains multiple values corresponding to each row) to a single datetime value, NOT iterating through the rows.

I don't really get the comparison you are trying to do, but it doesn't seem possible to be done following your current logic.

Collectives™ on Stack Overflow

Apply function for multiple dataframes and columns

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related