2

Given a data frame df:

 Column A: [0, 1, 3, 4, 6]

 Column B: [0, 0, 0, 0, 0]

The goal is to conditionally replace values in column B. If column A's values exist in a set assginedToA, we replace the corresponding values in column B with a constant b.

For example: if b=1 and assignedToA={1,4}, the result would be

Column A: [0, 1, 3, 4, 6]

Column B: [0, 1, 0, 1, 0]

My code for finding the A values and write B values into it looks like this:

df.loc[df['A'].isin(assignedToA),'B']=b

This code works, but it is really slow for a huge dataframe. Do you have any advice, how to speed this process up?


The dataframe df has around 5 Million rows and assignedToA has a maximum of 7 values.

1 Answer 1

2

You may find a performance improvement by dropping down to numpy:

df = pd.DataFrame({'A': [0, 1, 3, 4, 6],
                   'B': [0, 0, 0, 0, 0]})

def jp(df, vals, k):
    B = df['B'].values
    B[np.in1d(df['A'], list(vals))] = k
    df['B'] = B
    return df

def original(df, vals, k):
    df.loc[df['A'].isin(vals),'B'] = k
    return df

df = pd.concat([df]*100000)

%timeit jp(df, {1, 4}, 1)        # 8.55ms
%timeit original(df, {1, 4}, 1)  # 16.6ms
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much! The "np.in1d()" really helped me reducing the processing time by 60% !
@TAmoel, Excellent, feel free to accept if this resolves your question (green tick to left). But, equally, if this isn't fast enough you can wait a little longer for another answer :).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.