7
\$\begingroup\$

I have a csv file that looks like this:

,age,department,education,recruitment_type,job_level,rating,awards,certifications,salary,gender,entry_date,satisfied
0,28,HR,Postgraduate,Referral,5,2.0,1,0,78075.0,Male,2019-02-01,1
1,50,Technology,Postgraduate,Recruitment Agency,3,5.0,2,1,38177.1,Male,2017-01-17,0
2,43,Technology,Undergraduate,Referral,4,1.0,2,0,59143.5,Female,2012-08-27,1
3,44,Sales,Postgraduate,On-Campus,2,3.0,0,0,26824.5,Female,2017-07-25,1
4,33,HR,Undergraduate,Recruitment Agency,2,1.0,5,0,26824.5,Male,2019-05-17,1
5,40,Purchasing,Undergraduate,Walk-in,3,3.0,7,1,38177.1,Male,2004-04-22,1
6,26,Purchasing,Undergraduate,Referral,5,5.0,2,0,78075.0,Male,2019-12-10,1
7,25,Technology,Undergraduate,Recruitment Agency,1,1.0,4,0,21668.4,Female,2017-03-18,0
8,35,HR,Postgraduate,Referral,3,4.0,0,0,38177.1,Female,2015-04-02,1
9,45,Technology,Postgraduate,Referral,3,3.0,9,0,38177.1,Female,2004-03-19,0
10,31,Marketing,Undergraduate,Walk-in,4,4.0,6,0,59143.5,Male,2009-01-24,1
11,43,Technology,Postgraduate,Recruitment Agency,2,1.0,9,1,26824.5,Male,2016-03-10,1
12,28,Technology,Undergraduate,On-Campus,3,4.0,0,0,38177.1,Female,2013-04-24,0
13,48,Purchasing,Postgraduate,Referral,3,4.0,8,0,38177.1,Male,2010-07-25,1
14,52,Purchasing,Postgraduate,Recruitment Agency,5,1.0,7,0,78075.0,Male,2018-02-07,1
15,50,Purchasing,Undergraduate,Recruitment Agency,5,5.0,6,0,78075.0,Male,2014-04-24,1
16,34,Marketing,Postgraduate,On-Campus,1,4.0,9,0,21668.4,Male,2014-12-10,0
17,24,Purchasing,Undergraduate,Recruitment Agency,4,4.0,6,0,59143.5,Female,2018-02-18,1
18,54,HR,Postgraduate,On-Campus,1,5.0,4,0,21668.4,Female,2014-05-07,1
19,25,Sales,Undergraduate,Recruitment Agency,5,4.0,4,0,78075.0,Male,2012-02-15,1
20,35,HR,Undergraduate,On-Campus,2,4.0,4,0,26824.5,Female,2008-01-15,1
21,50,HR,Postgraduate,Referral,5,4.0,0,0,78075.0,Male,2015-04-13,1
22,34,Purchasing,Postgraduate,Referral,4,2.0,7,1,59143.5,Male,2013-07-02,1
23,37,Sales,Undergraduate,Recruitment Agency,5,5.0,0,1,78075.0,Male,2016-03-22,1
24,31,Sales,Postgraduate,Walk-in,4,4.0,3,1,59143.5,Female,2006-09-05,1
25,53,Sales,Postgraduate,Walk-in,4,5.0,8,1,59143.5,Female,2005-10-08,1
26,45,Marketing,Undergraduate,Walk-in,4,3.0,8,0,59143.5,Male,2008-01-08,1
27,40,Purchasing,Undergraduate,Walk-in,4,3.0,4,1,59143.5,Female,2005-11-19,0

The question that should be answered is how many people are recruited per department as a function of time. This should be shown in a line chart.

This was my solution:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("employees_satisfaction_transformed.csv", index_col=0)
recruitment_groups = df.groupby("recruitment_type")
campus = recruitment_groups.get_group("On-Campus")["entry_date"]
walk_in = recruitment_groups.get_group("Walk-in")["entry_date"]
referral = recruitment_groups.get_group("Referral")["entry_date"]
agency = recruitment_groups.get_group("Recruitment Agency")["entry_date"]

campus = campus.sort_values().reset_index()
campus['index'] = campus.index

walk_in = walk_in.sort_values().reset_index()
walk_in['index'] = walk_in.index

referral = referral.sort_values().reset_index()
referral['index'] = referral.index

agency = agency.sort_values().reset_index()
agency['index'] = agency.index

plt.plot(campus['entry_date'], campus['index'], label="campus")
plt.plot(walk_in['entry_date'], walk_in['index'], label="walk_in")
plt.plot(referral['entry_date'], referral['index'], label="referral")
plt.plot(agency['entry_date'], agency['index'], label="agency")
plt.legend(loc='best')
plt.show()

I'm sort of new to pandas so any critique is welcome.

\$\endgroup\$

2 Answers 2

5
\$\begingroup\$

J_H's advice about DRY is good for Python in general.

However for Pandas in particular, we should almost never iterate. Instead we chain methods that operate on the calling object.

Here your repeated grouping/sorting/counting/plotting calls can be reduced to two groupby methods:

So this is a more idiomatic DRY version in Pandas:

# format the dates properly using `parse_dates`
df = pd.read_csv('employees_satisfaction_transformed.csv', parse_dates=['entry_date'], index_col=0)

# sort once
df = df.sort_values('entry_date')

# count recruits by type using `groupby.cumcount`
df['count'] = df.groupby('recruitment_type').cumcount()

# plot count vs date using `groupby.plot`
df.set_index('entry_date').groupby('recruitment_type')['count'].plot(legend=True)

output of revised code

\$\endgroup\$
3
\$\begingroup\$

break out helpers

You serially assign these four variables:

  • campus
  • walk_in
  • referral
  • agency

This code is just crying out to you to define a helper method and then iterate over those four columns. Which would include the whole sort / reset thing.

There's an opportunity for a for loop to do some plotting, but that's a separate item.


define function

You created a bunch of top-level global variables. To reduce coupling define them within a function, perhaps def main():, so they go out of scope once the function exits and then they won't pollute the global namespace.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.