I have grouped following DF by host and operation columns:
df
Out[163]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 10069 to 1003
Data columns (total 8 columns):
args 100 non-null values
host 100 non-null values
kwargs 100 non-null values
log_timestamp 100 non-null values
operation 100 non-null values
thingy 100 non-null values
status 100 non-null values
time 100 non-null values
dtypes: float64(1), int64(2), object(5)
g = df.groupby(['host','operation'])
g
Out[165]: <pandas.core.groupby.DataFrameGroupBy object at 0x7f46ec731890>
g.groups.keys()[:10]
Out[166]:
[('yy39.segm1.org', 'gtfull'),
('yy39.segm1.org', 'updateWidg'),
('yy36.segm1.org', 'notifyTestsDelivered'),
('yy32.segm1.org', 'notifyTestsDelivered'),
('yy20.segm1.org', 'gSettings'),
('yy32.segm1.org', 'x_gWidgboxParams'),
('yy39.segm1.org', 'clearElems'),
('yy3.segm1.org', 'gxyzinf'),
('yy34.segm1.org', 'setFlagsOneWidg'),
('yy13.segm1.org', 'x_gbinf')]
Now I need to get individual DataFrames for each ('host', 'operation') pair. I can do it by iterating over groups keys:
for el in g.groups.keys():
...: print el, 'VALUES', g.groups[el]
...:
('yy25.segm1.org', 'x_gbinf') VALUES [10021]
('yy36.segm1.org', 'gxyzinf') VALUES [10074, 10085]
('yy25.segm1.org', 'updateWidg') VALUES [10022]
('yy25.segm1.org', 'gtfull') VALUES [10019]
('yy16.segm1.org', 'gxyzinf') VALUES [10052, 10055, 10062, 10064]
('yy32.segm1.org', 'addWidging2') VALUES [10034]
('yy16.segm1.org', 'notifyTestsDelivered') VALUES [10056, 10065]
Questions:
Q1. I'm wondering if I should split the DataFrameGroupBy object or is there a faster way of achieving the goal here?
Strategically: I need to calculate exponential weighted moving average and exponential weighted standard deviation (although std dev should be attenuated much more slowly).
To this end, I need have it:
a. grouped by host, operation
b. each host/operation subset sorted by log_timestamp
c. ewma and ewmstd calculated for time column.
Is there a way of achieving this without splitting DataFrameGroupBy?
Q2. The goal is to signal when particular time for host/operation becomes abnormal in the last several minutes (overload condition). I've had an idea that if I calculate 'slow ewmstd' and 'slow ewma' (for longer period of time, say, 1 hr) then the short-term ewma (say 5 minutes) could be interpreted as emergency value if it's more than 2 slow std deviations from the slow ewma (three sigma rule). I'm not even sure if this is correct / best approach. Is it?
It may be, since this is roughly similar to how UNIX 1m, 5m, and 15m load averages work: if 15m is normal but 1m load avg is much higher, you know that the load has been much higher than usual. But I'm not sure of that.
df.groupby(['host','operation']).apply(lambda x: do your calc on x).