I'm new to both python and pandas, and after trying out a few approaches, I was hoping to illicit some suggestions from everyone on the best approaches to structure this dataset, given the goals of my analysis.
Given the following DataFrame:
id event timestamp
1 "page 1 load" 1/1/2014 0:00:01
1 "page 1 exit" 1/1/2014 0:00:31
2 "page 2 load" 1/1/2014 0:01:01
2 "page 2 exit" 1/1/2014 0:01:31
3 "page 3 load" 1/1/2014 0:02:01
3 "page 3 exit" 1/1/2014 0:02:31
4 "page 1 load" 2/1/2014 1:00:01
4 "page 1 exit" 2/1/2014 1:00:31
5 "page 2 load" 2/1/2014 1:01:01
5 "page 2 exit" 2/1/2014 1:01:31
6 "page 3 load" 2/1/2014 1:02:01
6 "page 3 exit" 2/1/2014 1:02:31
The goal here would be to calculate time elapsed from a load to an exit. However, I first need to validate that the load and exit timestamps are indeed from the same session (id) before computing the time elapsed. The approach I am thinking of is to process the source dataset and create a new DataFrame where each row is a combination of already validated data, adding an elapsed column, making computation and grouping easier, like this.
id event_1 timestamp_1 event_2 timestamp_2 elapsed
1 "page 1 load" 1/1/2014 0:00:01 "page 1 exit" 1/1/2014 0:00:31 0:00:30
2 "page 2 load" 1/1/2014 0:01:01 "page 2 exit" 1/1/2014 0:01:31 0:00:30
3 "page 3 load" 1/1/2014 0:02:01 "page 3 exit" 1/1/2014 0:02:31 0:00:30
If this is a good approach? If so, what are the best methods to create this new DataFrame?