pandas: Efficiently avoid 0s when taking log of cells in DataFrame

Question

I want to take the log of each cell in a very sparse pandas DataFrame and must avoid the 0s. At first I was checking for 0s with a lambda function, then I thought it might be faster to replace the many 0s with NaNs. I got some inspiration from this closely related question, and tried using a "mask." Is there a better way?

# first approach
# 7.61 s ± 1.46 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def get_log_1(df):
    return df.applymap(
        lambda x: math.log(x) if x != 0 else 0)

# second approach (faster!)
# 5.36 s ± 968 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def get_log_2(df):
    return (df
            .replace(0, np.nan)
            .applymap(math.log)
            .replace(np.nan, 0))

# third apprach (even faster!!)
# 4.76 s ± 941 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def get_log_3(df):
    return (df
            .mask(df <= 0)
            .applymap(math.log)
            .fillna(0))

The df I'm using has shape (31064, 323) and is ~90% 0s. I think this generates something similar? np.put(np.zeros((30000, 300)), range(0, 3000), 1); df = pd.DataFrame(a).sample(frac=1) — Dustin Michels
– Dustin Michels, Commented Mar 10, 2018 at 10:37

jezrael · Accepted Answer · 2018-03-10 10:11:42Z

6

One possible solution is use numpy.log:

print (np.log(df.mask(df <=0)).fillna(0))

Or pure numpy:

df1= pd.DataFrame(np.ma.log(df.values).filled(0), index=df.index, columns=df.columns)

edited Mar 10, 2018 at 10:11

answered Mar 10, 2018 at 10:02

jezrael

868k102 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dustin Michels Over a year ago

Wow! So much faster, and cleaner! Using math.log instead of np.log was a rookie move on my part.

jezrael Over a year ago

@DustinMichels - Can you add timings to question?

Dustin Michels Over a year ago

341 ms ± 7.08 ms the first way, 35.6 ns ± 5.57 ns the second way (measured with the Ipython %timeit magic)

Collectives™ on Stack Overflow

pandas: Efficiently avoid 0s when taking log of cells in DataFrame

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related