groupby in pandas and plot

Question

I have a csv file that looks like this:

,age,department,education,recruitment_type,job_level,rating,awards,certifications,salary,gender,entry_date,satisfied
0,28,HR,Postgraduate,Referral,5,2.0,1,0,78075.0,Male,2019-02-01,1
1,50,Technology,Postgraduate,Recruitment Agency,3,5.0,2,1,38177.1,Male,2017-01-17,0
2,43,Technology,Undergraduate,Referral,4,1.0,2,0,59143.5,Female,2012-08-27,1
3,44,Sales,Postgraduate,On-Campus,2,3.0,0,0,26824.5,Female,2017-07-25,1
4,33,HR,Undergraduate,Recruitment Agency,2,1.0,5,0,26824.5,Male,2019-05-17,1
5,40,Purchasing,Undergraduate,Walk-in,3,3.0,7,1,38177.1,Male,2004-04-22,1
6,26,Purchasing,Undergraduate,Referral,5,5.0,2,0,78075.0,Male,2019-12-10,1
7,25,Technology,Undergraduate,Recruitment Agency,1,1.0,4,0,21668.4,Female,2017-03-18,0
8,35,HR,Postgraduate,Referral,3,4.0,0,0,38177.1,Female,2015-04-02,1
9,45,Technology,Postgraduate,Referral,3,3.0,9,0,38177.1,Female,2004-03-19,0
10,31,Marketing,Undergraduate,Walk-in,4,4.0,6,0,59143.5,Male,2009-01-24,1
11,43,Technology,Postgraduate,Recruitment Agency,2,1.0,9,1,26824.5,Male,2016-03-10,1
12,28,Technology,Undergraduate,On-Campus,3,4.0,0,0,38177.1,Female,2013-04-24,0
13,48,Purchasing,Postgraduate,Referral,3,4.0,8,0,38177.1,Male,2010-07-25,1
14,52,Purchasing,Postgraduate,Recruitment Agency,5,1.0,7,0,78075.0,Male,2018-02-07,1
15,50,Purchasing,Undergraduate,Recruitment Agency,5,5.0,6,0,78075.0,Male,2014-04-24,1
16,34,Marketing,Postgraduate,On-Campus,1,4.0,9,0,21668.4,Male,2014-12-10,0
17,24,Purchasing,Undergraduate,Recruitment Agency,4,4.0,6,0,59143.5,Female,2018-02-18,1
18,54,HR,Postgraduate,On-Campus,1,5.0,4,0,21668.4,Female,2014-05-07,1
19,25,Sales,Undergraduate,Recruitment Agency,5,4.0,4,0,78075.0,Male,2012-02-15,1
20,35,HR,Undergraduate,On-Campus,2,4.0,4,0,26824.5,Female,2008-01-15,1
21,50,HR,Postgraduate,Referral,5,4.0,0,0,78075.0,Male,2015-04-13,1
22,34,Purchasing,Postgraduate,Referral,4,2.0,7,1,59143.5,Male,2013-07-02,1
23,37,Sales,Undergraduate,Recruitment Agency,5,5.0,0,1,78075.0,Male,2016-03-22,1
24,31,Sales,Postgraduate,Walk-in,4,4.0,3,1,59143.5,Female,2006-09-05,1
25,53,Sales,Postgraduate,Walk-in,4,5.0,8,1,59143.5,Female,2005-10-08,1
26,45,Marketing,Undergraduate,Walk-in,4,3.0,8,0,59143.5,Male,2008-01-08,1
27,40,Purchasing,Undergraduate,Walk-in,4,3.0,4,1,59143.5,Female,2005-11-19,0

The question that should be answered is how many people are recruited per department as a function of time. This should be shown in a line chart.

This was my solution:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("employees_satisfaction_transformed.csv", index_col=0)
recruitment_groups = df.groupby("recruitment_type")
campus = recruitment_groups.get_group("On-Campus")["entry_date"]
walk_in = recruitment_groups.get_group("Walk-in")["entry_date"]
referral = recruitment_groups.get_group("Referral")["entry_date"]
agency = recruitment_groups.get_group("Recruitment Agency")["entry_date"]

campus = campus.sort_values().reset_index()
campus['index'] = campus.index

walk_in = walk_in.sort_values().reset_index()
walk_in['index'] = walk_in.index

referral = referral.sort_values().reset_index()
referral['index'] = referral.index

agency = agency.sort_values().reset_index()
agency['index'] = agency.index

plt.plot(campus['entry_date'], campus['index'], label="campus")
plt.plot(walk_in['entry_date'], walk_in['index'], label="walk_in")
plt.plot(referral['entry_date'], referral['index'], label="referral")
plt.plot(agency['entry_date'], agency['index'], label="agency")
plt.legend(loc='best')
plt.show()

I'm sort of new to pandas so any critique is welcome.

tdy · Accepted Answer · 2023-07-19 18:38:49Z

J_H's advice about DRY is good for Python in general.

However for Pandas in particular, we should almost never iterate. Instead we chain methods that operate on the calling object.

Here your repeated grouping/sorting/counting/plotting calls can be reduced to two groupby methods:

So this is a more idiomatic DRY version in Pandas:

# format the dates properly using `parse_dates`
df = pd.read_csv('employees_satisfaction_transformed.csv', parse_dates=['entry_date'], index_col=0)

# sort once
df = df.sort_values('entry_date')

# count recruits by type using `groupby.cumcount`
df['count'] = df.groupby('recruitment_type').cumcount()

# plot count vs date using `groupby.plot`
df.set_index('entry_date').groupby('recruitment_type')['count'].plot(legend=True)

J_H · Accepted Answer · 2023-05-29 04:23:23Z

3

break out helpers

You serially assign these four variables:

campus
walk_in
referral
agency

This code is just crying out to you to define a helper method and then iterate over those four columns. Which would include the whole sort / reset thing.

There's an opportunity for a for loop to do some plotting, but that's a separate item.

define function

You created a bunch of top-level global variables. To reduce coupling define them within a function, perhaps def main():, so they go out of scope once the function exits and then they won't pollute the global namespace.

edited May 29, 2023 at 4:23

answered May 29, 2023 at 4:18

J_H

42.2k3 gold badges38 silver badges157 bronze badges

Add a comment |

Stack Exchange Network

groupby in pandas and plot

2 Answers 2

break out helpers

define function

You must log in to answer this question.

Hot Network Questions

groupby in pandas and plot

2 Answers 2

break out helpers

define function

You must log in to answer this question.

Related

Hot Network Questions