This was inspired by Aggregate loans report without using Python standard aggregate or group functions question, but I've decided to approach it using pandas.
To recap, sample input:
MSISDN,Network,Date,Product,Amount
1,Network 1,12-Mar-2016,Loan Product 1,1000
2,Network 2,16-Mar-2016,Loan Product 1,1122
3,Network 3,17-Mar-2016,Loan Product 2,2084
4,Network 3,18-Mar-2016,Loan Product 2,3098
5,Network 2,01-Apr-2016,Loan Product 1,5671
Desired output:
Network,Product,Month\Year,Currency,Count
Network 1,Loan Product 1,03-16,1000,1
Network 2,Loan Product 1,03-16,1122,1
Network 2,Loan Product 1,04-16,5671,1
Network 3,Loan Product 2,03-16,5182,2
In other words, the task is to group the data from the input.csv file by Network, Product and month+year of the Date column, then calculate the sum of Currency column keeping track of counts in each group.
I've solved it via creating a separate Month\Year column first, loading the Date values into datetime objects and dumping into a month-year format, then grouping by the desired columns using .groupby() and then aggregating with sum and count with further renaming the columns to the desired column names:
from datetime import datetime
import pandas as pd
df = pd.read_csv('input.csv')
df['Month\Year'] = df['Date'].apply(lambda s: datetime.strptime(s, "%d-%b-%Y").strftime('%m-%y'))
grouped = df.groupby(['Network', 'Product', 'Month\Year'])['Amount']
df = grouped.agg(['sum', 'count']).rename(columns={'sum': 'Currency', 'count': 'Count'}).reset_index()
df.to_csv('output.csv', index=False)
Is this the most optimal and readable pandas-based solution? Can it be further improved?
I am particularly not quite happy with renaming columns after aggregation - there should be a more straightforward way to aggregate into the custom-named columns.
Datecolumn is adatetimeyou could skip the new column added and useGrouper, but that leaves you withdatetimeobjects instead ofstrs. I don't really know a way to avoid the renaming at the end \$\endgroup\$Dateis not adatetime, you can usepd.to_datetime()instead of thelambdaandstrptime\$\endgroup\$