Python: Select Rows by value in large dataframe

Question

Given a data frame df:

 Column A: [0, 1, 3, 4, 6]

 Column B: [0, 0, 0, 0, 0]

The goal is to conditionally replace values in column B. If column A's values exist in a set assginedToA, we replace the corresponding values in column B with a constant b.

For example: if b=1 and assignedToA={1,4}, the result would be

Column A: [0, 1, 3, 4, 6]

Column B: [0, 1, 0, 1, 0]

My code for finding the A values and write B values into it looks like this:

df.loc[df['A'].isin(assignedToA),'B']=b

This code works, but it is really slow for a huge dataframe. Do you have any advice, how to speed this process up?

The dataframe df has around 5 Million rows and assignedToA has a maximum of 7 values.

jpp · Accepted Answer · 2018-06-05 12:49:13Z

2

You may find a performance improvement by dropping down to numpy:

df = pd.DataFrame({'A': [0, 1, 3, 4, 6],
                   'B': [0, 0, 0, 0, 0]})

def jp(df, vals, k):
    B = df['B'].values
    B[np.in1d(df['A'], list(vals))] = k
    df['B'] = B
    return df

def original(df, vals, k):
    df.loc[df['A'].isin(vals),'B'] = k
    return df

df = pd.concat([df]*100000)

%timeit jp(df, {1, 4}, 1)        # 8.55ms
%timeit original(df, {1, 4}, 1)  # 16.6ms

edited Jun 5, 2018 at 12:49

answered Jun 5, 2018 at 12:41

jpp

166k37 gold badges301 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

TAmoel Over a year ago

Thank you very much! The "np.in1d()" really helped me reducing the processing time by 60% !

jpp Over a year ago

@TAmoel, Excellent, feel free to accept if this resolves your question (green tick to left). But, equally, if this isn't fast enough you can wait a little longer for another answer :).

Collectives™ on Stack Overflow

Python: Select Rows by value in large dataframe

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related