Finding peaks in a DataFrame

Question

I would like to compare the min/max values of a time-series with a test time-series. Additionally, I would like to compare the time of the "peaks". However, I'm having trouble extracting these features from a Pandas DataFrame.

Given the following data:

def fake_phase_data():
    in_li = []
    sample_points = 24 * 4

    for day, bias in zip((11, 12, 13), (.5, .7, 1.)):
        day_time = datetime(2016, 6, day, 0, 0, 0)
        for x in range(int(sample_points)):

            in_li.append((day_time + timedelta(minutes=15*x),
                          bias * np.sin(2 * np.pi * x / sample_points + (1.2*bias)),
                          bias))

    fake_df = pd.DataFrame(in_li, columns=("time", "phase_sig", "bias")).set_index("time")
    return fake_df


fp = fake_phase_data()
# Convert to pivot-table with 24 hour columns
dfs = {
    col: pd.pivot_table(
        fp,
        index=fp.index.date,
        columns=fp.index.hour,
        values=col,
        aggfunc='mean',
    )
    for col in fp.columns
}
ddf = pd.concat(dfs, axis=1)

Which looks like:

for i in range(len(ddf)):
    ddf["phase_sig"].iloc[i].plot()

I process the data:

def col_peaks(df, cols, peak_func):
    return [list(getattr(df[col], peak_func)(axis=1).values) for col in cols]


def peak_vals(df, cols, t_peak):
    peak_v = []

    for c_i, col in enumerate(cols):
        vals = df[col].values
        peak_idx = t_peak[c_i]
        peak_v.append(list(vals[np.arange(len(peak_idx)), peak_idx]))

    return peak_v

# I may want to process multiple columns later
# but let's focus on the single-column case
x_cols = ["phase_sig"]

# Technically, I also want the minimum
# but let's focus on the maximum case first
orig_t_max = col_peaks(ddf, x_cols, "idxmax")
print("Orig t_max", orig_t_max)

orig_v_max = peak_vals(ddf, x_cols, orig_t_max)
print("Orig v_max", orig_v_max)

# actual test data will be a single row in a dataframe
# but this test is fine for now
test_df = ddf.iloc[[0]]
test_t_max = col_peaks(test_df, x_cols, "idxmax")
print("Test t_max", test_t_max)

test_v_max = peak_vals(test_df, x_cols, test_t_max)
print("Test v_max", test_v_max)

And get the result:

Orig t_max [[4, 3, 1]]
Orig v_max [[0.4985414229286749, 0.6989567830263389, 0.9940657122457474]]
Test t_max [[4]]
Test v_max [[0.4985414229286749]]

How do I get both of these values without the weird loops I'm doing? I know I could make them more compact by using a list-comprehension, but I'd rather get rid of them altogether. Is there a way to deal with both DataFrame and Series without the awkward if-statement I use in col_peaks and peak_vals?

You could avoid using Series by e.g. .iloc[[0]] instead of .iloc[0], then just remove the two else branches. — ferada
– ferada, Commented Sep 28, 2018 at 22:09
@ferada you are correct. I have edited the question accordingly. I would have sworn I tried that... — Seanny123
– Seanny123, Commented Oct 1, 2018 at 21:59
I can add a reference implementation if requested, but you can numerically differentiate this (using vectorized operations). When the first derivative is zero (or zero crossing, since this is numerical differentiation), it's an inflection point (peak or valley). The sign of the second derivative at that position tells you whether its a peak or valley. You can calculate the numerical derivative by doing [x_(t+1) - x_t] / [(t+1) - t] — Zack
– Zack, Commented Oct 5, 2018 at 14:40
@Zack if you sketch out a basic answer, I can award the bounty to you. But yes, you're right, I forgot I could just do this operation with derivation. — Seanny123
– Seanny123, Commented Oct 5, 2018 at 14:56
@Seanny123 I don't have time to do one before this closes (I'd have to brush up on pandas syntax; I haven't used it in a few years), but glad I could help :) — Zack
– Zack, Commented Oct 5, 2018 at 15:01

Reinderien · Accepted Answer · 2025-01-25 23:19:09Z

Rather than this:

for day, bias in zip((11, 12, 13), (.5, .7, 1.)):

you should use Numpy broadcasting for simplicity and performance.

You shouldn't use datetime(); instead use the built-in Pandas routines to generate timestamp indices which will perform better.

Your pivot code should be simplified by the use of resample(). Your last maximum operation can also use resample, so long as you don't care about t_max. If you do care about t_max, then unfortunately resample doesn't have any direct implementations so instead you can construct a pseudo-group where the groups are represented by columns. I expect this to be faster and it takes less code to write.

import numpy as np
import pandas as pd


def fake_phase_data() -> pd.DataFrame:
    sample_points = 24 * 4
    x = np.arange(sample_points)
    bias = np.array((0.5, 0.7, 1.0))[:, np.newaxis]
    sig = bias * np.sin(2*np.pi/sample_points*x + 1.2*bias)

    return pd.DataFrame(
        index=pd.date_range(
            name='time', inclusive='left', freq='15min',
            start=pd.Timestamp(year=2016, month=6, day=11),
            end=pd.Timestamp(year=2016, month=6, day=14),
        ),
        data={
            'phase_sig': sig.ravel(),
            'bias': np.broadcast_to(bias, (bias.size, sample_points)).ravel(),
        },
    )


def demo() -> None:
    fp = fake_phase_data()

    # Reduce to an hourly mean, minus time-of-day information
    ddf = fp.resample('h').mean()
    day_maxima = ddf['phase_sig'].resample('d').max()
    print('Daily sig maxima:')
    print(day_maxima)
    print()

    # resample() has no .argmax or .idxmax, so we need to do a pseudo-group
    by_day = ddf[['phase_sig']].set_index(
        [
            pd.Index(name='day', data=ddf.index.day),
            pd.Index(name='hour', data=ddf.index.hour),
        ],
    ).unstack(level='day')
    print('Daily max times:')
    print(by_day.idxmax())


if __name__ == '__main__':
    demo()

Stack Exchange Network

Finding peaks in a DataFrame

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Finding peaks in a DataFrame

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions