4
\$\begingroup\$

I'm trying to build a function that identifies those who are promoted into a list of jobcodes, or are promoted within that list of jobcodes.

Initially I was using np.where() until I realized it would actually capture those who were demoted as well.

Here's what I'm currently working with which works but I'm slightly unhappy with how it looks / appears. Does anyone have a different technique / approach for solving a problem like this?

Mostly curious if anyone has any criticism, thanks!

The idea is that anyone whose current paygroup is in "promotions" should be flagged if they've:

A) moved into there from a paygroup that isn't in the tuple
OR
B) assume AGM4 < GM2 < ADO3 and anyone who moves up that hierarchy should be considered promoted

def index_checker(cur,prev):
    promotions = ("AGM4", "GM2","ADO3")
    if cur not in promotions:
        return False
    else:
        return promotions.index(cur) > promotions.index(prev) if prev in promotions and cur in promotions else True
 
df["Promoted"] = np.vectorize(index)(df["PayGroup_cur"].values,df["PayGroup_prev"].values)
df[df["Promoted"]==True].to_csv(r"location.csv")

This approach didn't work because it would consider someone who moved from ADO3 to AGM4 a promotion. I tried to add the logic of the index checker within this condition list, and then I kept running into broadcast shaping issues and truth ambiguities

promotions = ("AGM4", "GM2","ADO3")
condition = (np.isin(df["PayGroup_cur"].values,promotions) & (df["PayGroup_cur"].values != df["PayGroup_prev"].values) & (promotions.index(df["PayGroup_cur"].values) > promotions.index(df["PayGroup_cur"].values)))
df["Promoted"] = np.where(condition, "Promoted", "Not Promoted")
df[df["Promoted"]=="Promoted"]
\$\endgroup\$
2
  • 1
    \$\begingroup\$ If the first approach didn't work because of false positives and the second approach doesn't work due to due to shapes and ambiguities, then this isn't on-topic - we are only able to review working code. \$\endgroup\$ Commented Sep 16, 2024 at 23:18
  • \$\begingroup\$ I thought their first approach did work: "Here's what I'm currently working with which works but I'm slightly unhappy with how it looks / appears" \$\endgroup\$ Commented Sep 16, 2024 at 23:28

1 Answer 1

3
\$\begingroup\$

The main issue with numpy's vectorize function is that it's actually not vectorized. It's an unfortunate misnomer:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

Here a simple way to actually vectorize would be to map the jobcodes to their numerical ranks and just compare the ranks (assuming the promotions are ordered, which is indeed the case in your provided example):

ranks = dict(zip(promotions, range(len(promotions))))
df['Promoted'] = df['PayGroup_cur'].map(ranks) > df['PayGroup_prev'].map(ranks)

If promotions is large, then vectorize that as well using Series instead of dict-zip:

ranks = pd.Series(range(len(promotions)), index=promotions)  # ~40x faster given 10K jobcodes

Concrete example:

import pandas as pd

promotions = ('AGM4', 'GM2', 'ADO3')
df = pd.DataFrame({"PayGroup_prev": ["---", "ADO3", "AGM4", "AGM4", "AGM4", "AGM4", "ADO3"], "PayGroup_cur": ["AGM4", "GM2", "ADO3", "???", "AGM4", "GM2", "ADO3"]})
#   PayGroup_prev  PayGroup_cur
# 0           ---          AGM4
# 1          ADO3           GM2
# 2          AGM4          ADO3
# 3          AGM4           ???
# 4          AGM4          AGM4
# 5          AGM4           GM2
# 6          ADO3          ADO3

ranks = dict(zip(promotions, range(len(promotions))))
# {'AGM4': 0, 'GM2': 1, 'ADO3': 2}

df['PayGroup_prev_rank'] = df['PayGroup_prev'].map(ranks)
df['PayGroup_cur_rank'] = df['PayGroup_cur'].map(ranks)

df['Promoted'] = df['PayGroup_cur_rank'] > df['PayGroup_prev_rank']
#   PayGroup_prev  PayGroup_cur  PayGroup_prev_rank  PayGroup_cur_rank  Promoted
# 0           ---          AGM4                 NaN                0.0     False
# 1          ADO3           GM2                 2.0                1.0     False
# 2          AGM4          ADO3                 0.0                2.0      True
# 3          AGM4           ???                 0.0                NaN     False
# 4          AGM4          AGM4                 0.0                0.0     False
# 5          AGM4           GM2                 0.0                1.0      True
# 6          ADO3          ADO3                 2.0                2.0     False
\$\endgroup\$
2
  • \$\begingroup\$ Keep in mind that dict/zip aren't vectorised, either. \$\endgroup\$ Commented Sep 16, 2024 at 23:14
  • 1
    \$\begingroup\$ Fair enough. I added a Series alternative if the promotions length is quite long. \$\endgroup\$ Commented Sep 16, 2024 at 23:26

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.