Split DataFrameGroupBy into individual frames (Pandas)

Question

I have grouped following DF by host and operation columns:

df
Out[163]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 10069 to 1003
Data columns (total 8 columns):
args             100  non-null values
host             100  non-null values
kwargs           100  non-null values
log_timestamp    100  non-null values
operation        100  non-null values
thingy             100  non-null values
status           100  non-null values
time             100  non-null values
dtypes: float64(1), int64(2), object(5)


g = df.groupby(['host','operation'])

g
Out[165]: <pandas.core.groupby.DataFrameGroupBy object at 0x7f46ec731890>

g.groups.keys()[:10]
Out[166]:
[('yy39.segm1.org', 'gtfull'),
 ('yy39.segm1.org', 'updateWidg'),
 ('yy36.segm1.org', 'notifyTestsDelivered'),
 ('yy32.segm1.org', 'notifyTestsDelivered'),
 ('yy20.segm1.org', 'gSettings'),
 ('yy32.segm1.org', 'x_gWidgboxParams'),
 ('yy39.segm1.org', 'clearElems'),
 ('yy3.segm1.org', 'gxyzinf'),
 ('yy34.segm1.org', 'setFlagsOneWidg'),
 ('yy13.segm1.org', 'x_gbinf')]

Now I need to get individual DataFrames for each ('host', 'operation') pair. I can do it by iterating over groups keys:

for el in g.groups.keys():
     ...:     print el, 'VALUES', g.groups[el]
     ...:
('yy25.segm1.org', 'x_gbinf') VALUES [10021]
('yy36.segm1.org', 'gxyzinf') VALUES [10074, 10085]
('yy25.segm1.org', 'updateWidg') VALUES [10022]
('yy25.segm1.org', 'gtfull') VALUES [10019]
('yy16.segm1.org', 'gxyzinf') VALUES [10052, 10055, 10062, 10064]
('yy32.segm1.org', 'addWidging2') VALUES [10034]
('yy16.segm1.org', 'notifyTestsDelivered') VALUES [10056, 10065]

Questions:

Q1. I'm wondering if I should split the DataFrameGroupBy object or is there a faster way of achieving the goal here?

Strategically: I need to calculate exponential weighted moving average and exponential weighted standard deviation (although std dev should be attenuated much more slowly).

To this end, I need have it:

a. grouped by host, operation

b. each host/operation subset sorted by log_timestamp

c. ewma and ewmstd calculated for time column.

Is there a way of achieving this without splitting DataFrameGroupBy?

Q2. The goal is to signal when particular time for host/operation becomes abnormal in the last several minutes (overload condition). I've had an idea that if I calculate 'slow ewmstd' and 'slow ewma' (for longer period of time, say, 1 hr) then the short-term ewma (say 5 minutes) could be interpreted as emergency value if it's more than 2 slow std deviations from the slow ewma (three sigma rule). I'm not even sure if this is correct / best approach. Is it?

It may be, since this is roughly similar to how UNIX 1m, 5m, and 15m load averages work: if 15m is normal but 1m load avg is much higher, you know that the load has been much higher than usual. But I'm not sure of that.

read here: pandas.pydata.org/pandas-docs/dev/groupby.html, you just need to df.groupby(['host','operation']).apply(lambda x: do your calc on x). — Jeff
– Jeff, Commented Dec 19, 2013 at 19:56

itzy · Accepted Answer · 2015-05-29 19:07:53Z

2

docs are here

you just need to:

def f(x):
     return a calculation on x

f can also be lambda x: ....

df.groupby(['host','operation']).apply(f)

edited May 29, 2015 at 19:07

itzy

11.9k16 gold badges65 silver badges99 bronze badges

answered Dec 19, 2013 at 20:27

Jeff

130k21 gold badges223 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

8one6 Over a year ago

One thing that was news to me: x up there is a full blown Pandas object (usually a DataFrame, sometimes a Series, depends on context). So you can do all sorts of useful stuff to x within f. For example: x['newcol'] = super_awesome_function(x) works fine. As does something like x['movavg'] = pd.rolling_mean(x, periods=20). So you can use f to tack a few more columns onto x and then, ultimately, just return x.

Jeff Over a year ago

groupby is very powerful. you can do quite complicated types of ops with it. however, sometimes it makese sense to use the vectorized on the whole frame, either before after a groupby (as a groupby is essentially a looped operation).

Jeff Over a year ago

in general it IS faster to return a new object (and not to modify the current object); this is the slow/fast path stuff.

Collectives™ on Stack Overflow

Split DataFrameGroupBy into individual frames (Pandas)

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related