validate rows before computing values using python/pandas DataFrame

Question

I'm new to both python and pandas, and after trying out a few approaches, I was hoping to illicit some suggestions from everyone on the best approaches to structure this dataset, given the goals of my analysis.

Given the following DataFrame:

id          event          timestamp
1   "page 1 load"   1/1/2014 0:00:01
1   "page 1 exit"   1/1/2014 0:00:31
2   "page 2 load"   1/1/2014 0:01:01
2   "page 2 exit"   1/1/2014 0:01:31
3   "page 3 load"   1/1/2014 0:02:01
3   "page 3 exit"   1/1/2014 0:02:31
4   "page 1 load"   2/1/2014 1:00:01
4   "page 1 exit"   2/1/2014 1:00:31
5   "page 2 load"   2/1/2014 1:01:01
5   "page 2 exit"   2/1/2014 1:01:31
6   "page 3 load"   2/1/2014 1:02:01
6   "page 3 exit"   2/1/2014 1:02:31

The goal here would be to calculate time elapsed from a load to an exit. However, I first need to validate that the load and exit timestamps are indeed from the same session (id) before computing the time elapsed. The approach I am thinking of is to process the source dataset and create a new DataFrame where each row is a combination of already validated data, adding an elapsed column, making computation and grouping easier, like this.

id      event_1          timestamp_1        event_2       timestamp_2   elapsed
1   "page 1 load"   1/1/2014 0:00:01    "page 1 exit"    1/1/2014 0:00:31   0:00:30
2   "page 2 load"   1/1/2014 0:01:01    "page 2 exit"    1/1/2014 0:01:31   0:00:30
3   "page 3 load"   1/1/2014 0:02:01    "page 3 exit"    1/1/2014 0:02:31   0:00:30

If this is a good approach? If so, what are the best methods to create this new DataFrame?

Andy Hayden · Accepted Answer · 2014-01-20 07:13:18Z

I would consider breaking event into seperate columns e.g. using str.extract:

In [11]: df['event'].str.extract('page (?P<page>\d+) (?P<event>\w+)')
Out[11]: 
   page event
0     1  load
1     1  exit
2     2  load
3     2  exit
4     3  load
5     3  exit
6     1  load
7     1  exit
8     2  load
9     2  exit
10    3  load
11    3  exit

[12 rows x 2 columns]

If you set this to the columns:

In [12]: df[['page', 'event']] = df['event'].str.extract('page (?P<page>\d+) (?P<event>\w+)')

In [13]: df
Out[13]: 
    id event           timestamp page
0    1  load 2014-01-01 00:00:01    1
1    1  exit 2014-01-01 00:00:31    1
2    2  load 2014-01-01 00:01:01    2
3    2  exit 2014-01-01 00:01:31    2
4    3  load 2014-01-01 00:02:01    3
5    3  exit 2014-01-01 00:02:31    3
6    4  load 2014-02-01 01:00:01    1
7    4  exit 2014-02-01 01:00:31    1
8    5  load 2014-02-01 01:01:01    2
9    5  exit 2014-02-01 01:01:31    2
10   6  load 2014-02-01 01:02:01    3
11   6  exit 2014-02-01 01:02:31    3

[12 rows x 4 columns]

I think this is now much easier to reason about...

Then you can groupby the session id and the page:

In [14]: g = df.groupby(['id', 'page'])  # perhaps use as_index=False

One way to get the desired result is to use an apply e.g. via the following function:

def get_load_and_exit(x):
    assert len(x) == 2  # otherwise, logic should be about pairs (?)
    return pd.Series(x['timestamp'].values, x['event'])

In [16]: res = g.apply(get_load_and_exit)

In [17]: res
Out[17]: 
event                  load                exit
id page                                        
1  1    2014-01-01 00:00:01 2014-01-01 00:00:31
2  2    2014-01-01 00:01:01 2014-01-01 00:01:31
3  3    2014-01-01 00:02:01 2014-01-01 00:02:31
4  1    2014-02-01 01:00:01 2014-02-01 01:00:31
5  2    2014-02-01 01:01:01 2014-02-01 01:01:31
6  3    2014-02-01 01:02:01 2014-02-01 01:02:31

[6 rows x 2 columns]

Assuming these are Timestamps*, you can simply subtract the columns:

In [18]: res['duration'] = res['exit'] - res['load']

In [19]: res
Out[19]: 
event                  load                exit  duration
id page                                                  
1  1    2014-01-01 00:00:01 2014-01-01 00:00:31  00:00:30
2  2    2014-01-01 00:01:01 2014-01-01 00:01:31  00:00:30
3  3    2014-01-01 00:02:01 2014-01-01 00:02:31  00:00:30
4  1    2014-02-01 01:00:01 2014-02-01 01:00:31  00:00:30
5  2    2014-02-01 01:01:01 2014-02-01 01:01:31  00:00:30
6  3    2014-02-01 01:02:01 2014-02-01 01:02:31  00:00:30

[6 rows x 3 columns]

*It's worth ensuring the timestamp column is actually a datetime column (atm it's not!):

df['timestamp'] = pd.to_datetime(df['timestamp'])

thanks so much for the detailed response, it is very much appreciated! Yep I'll ensure the timestamps are valid datetime objects (I think the read_csv method will automatically do this on read?) and good tip on the assert. There's good stuff to unpack here so I'll get back with my progress and accept the answer.
You can use parse_dates arg of read_csv (it's not automatic), check (for datetime64) with df.dtypes.

Collectives™ on Stack Overflow

validate rows before computing values using python/pandas DataFrame

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related