Pandas setting column subset slow

Question

I had to cast a subset of columns of a big DataFrame in pandas... it was very slow. So I made a few tests and discovered that the casting itself is done very fast. But Pandas seems to be slow when attributing the newly casted values to the old DataFrame.

I then came up with another solution performing a join and avoiding attributing to a column subset which runs pretty fast.

Why is pandas so slow? Might this be a bug? Can anyone reproduce the results?

Edit:

More tests and the code used to produce the DataFrame.

What dtypes does your DF have before casting? Do you have NaN's in your numeric columns? — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Jun 15, 2016 at 21:33
before casting all columns in the subset have np.int64 as dtype. There are no NaN's. — Alan Höng
– Alan Höng, Commented Jun 15, 2016 at 22:31

chrisb · Accepted Answer · 2016-06-15 23:17:15Z

1

There was just a doc note added about this - see here.

Basically you don't want to use loc when casting - instead do:

df[f] = df[f].astype(float)

Also, fyi the copy=False doesn't do any harm here, but it doesn't do any good either - going from ints to floats you're going to have to allocate a new array.

Edit - this was slower than I thought. Here's something of a workaround:

In [61]: df = pd.DataFrame(np.random.randint(0,1000, size=(10000, 1026)))

In [62]: f = list(range(1024))

In [63]: def cast(s):
    ...:     if s.name in f:
    ...:         return s.astype(float)
    ...:     else:
    ...:         return s

In [64]: %timeit df.apply(cast)
1 loop, best of 3: 389 ms per loop

edited Jun 15, 2016 at 23:17

answered Jun 15, 2016 at 22:56

chrisb

52.7k8 gold badges73 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Alan Höng Over a year ago

Humm tried it casting from float to int and the other way round too. Also tried without loc as suggested by you (didn't include it on the picture though) but still slow.

Alan Höng Over a year ago

I added the benchmark to the question. Thanks for the info about copy I was unsure of how exactly it works :)

chrisb Over a year ago

My timing was off, this is slower than I'd hope (you still shouldn't use loc for casting). It's a bit of hack, but I've updated my answer with something that's faster and will make an issue.

chrisb Over a year ago

Also, for future reference, post copy-paste-able code, rather than images.

Alan Höng Over a year ago

Neat solution :). Didn't know how to paste notebook code + output into SO.. I'm at work so had to do it fast. Will do next time. Thanks!

user1689987 · Accepted Answer · 2023-10-08 23:52:31Z

0

dropping the column before resetting it speeds up the time, can also try to use np.arry:

column_names = newShortEntries.select_dtypes(include=[object]).columns
temp =  newShortEntries[column_names].astype(bool) #np.array(newShortEntries[column_names], dtype=np.bool_)
newShortEntries = newShortEntries.drop(columns=column_names)
newShortEntries[column_names] = temp

answered Oct 8, 2023 at 23:52

user1689987

1,6661 gold badge14 silver badges29 bronze badges

Collectives™ on Stack Overflow

Pandas setting column subset slow

Edit:

2 Answers 2

5 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

Edit:

2 Answers 2

5 Comments

Comments

Related