1

I had to cast a subset of columns of a big DataFrame in pandas... it was very slow. So I made a few tests and discovered that the casting itself is done very fast. But Pandas seems to be slow when attributing the newly casted values to the old DataFrame.

I then came up with another solution performing a join and avoiding attributing to a column subset which runs pretty fast.

Why is pandas so slow? Might this be a bug? Can anyone reproduce the results?

slow pandas

Edit:

More tests and the code used to produce the DataFrame.

slow pandas 2

2
  • What dtypes does your DF have before casting? Do you have NaN's in your numeric columns? Commented Jun 15, 2016 at 21:33
  • before casting all columns in the subset have np.int64 as dtype. There are no NaN's. Commented Jun 15, 2016 at 22:31

2 Answers 2

1

There was just a doc note added about this - see here.

Basically you don't want to use loc when casting - instead do:

df[f] = df[f].astype(float)

Also, fyi the copy=False doesn't do any harm here, but it doesn't do any good either - going from ints to floats you're going to have to allocate a new array.

Edit - this was slower than I thought. Here's something of a workaround:

In [61]: df = pd.DataFrame(np.random.randint(0,1000, size=(10000, 1026)))

In [62]: f = list(range(1024))

In [63]: def cast(s):
    ...:     if s.name in f:
    ...:         return s.astype(float)
    ...:     else:
    ...:         return s

In [64]: %timeit df.apply(cast)
1 loop, best of 3: 389 ms per loop
Sign up to request clarification or add additional context in comments.

5 Comments

Humm tried it casting from float to int and the other way round too. Also tried without loc as suggested by you (didn't include it on the picture though) but still slow.
I added the benchmark to the question. Thanks for the info about copy I was unsure of how exactly it works :)
My timing was off, this is slower than I'd hope (you still shouldn't use loc for casting). It's a bit of hack, but I've updated my answer with something that's faster and will make an issue.
Also, for future reference, post copy-paste-able code, rather than images.
Neat solution :). Didn't know how to paste notebook code + output into SO.. I'm at work so had to do it fast. Will do next time. Thanks!
0

dropping the column before resetting it speeds up the time, can also try to use np.arry:

column_names = newShortEntries.select_dtypes(include=[object]).columns
temp =  newShortEntries[column_names].astype(bool) #np.array(newShortEntries[column_names], dtype=np.bool_)
newShortEntries = newShortEntries.drop(columns=column_names)
newShortEntries[column_names] = temp 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.