Question: was using get_dummies a good choice for converting categorical strings?
I used get_dummies to convert categorical variables into dummy / indicator variables for a cold start recommender system. It's only using category type information and some basic limited choices.
The code works and the output seems good. This is my first data science project, which is for fun. I kind of put this together from reading documentation and searching Stack Overflow. I plan on making it a hybrid recommender system soon by adding sentiment analysis and topic classification. Both of which I also recently finished.
To check another character just Input a different character name in userInput. The full notebook and Excel sheet for import are on my GitHub.
I would be grateful if someone could comment on whether or not I was able to achieve the following goals (and of course if not, what I can improve):
- Code structure,
- Style and readability: Is the code comprehensible?
- Are there any bad practices?
These are the attributes in my code:
Character's name (must be unique)
herotype (must be one of following choices)
- Bard
- Sorcerer
- Paladin
- Rogue
- Druid
- Sorcerer
weapons (can have one or multiple of following choices)
- Dagger
- sling
- club
- light crossbow
- battleaxe
- Greataxe
spells (can have one or multiple of following choices)
- Transmutation
- Enchantment
- Necromancy
- Abjuration
- Conjuration
- Evocation
Input and Output
Get Another Recommendation
You just input the username 'Irv' to another one like 'Zed Ryley' etc
userInput = [
{'name':'Irv', 'rating':1}
The results
come back formatted like this
name herotype weapons spells
28 Irv Sorcerer light crossbow Conjuration
9 yac Sorcerer Greataxe Conjuration, Evocation, Transmutation
18 Traubon Durthane Sorcerer light crossbow Evocation, Transmutation, Necromancy
8 wuc Sorcerer light crossbow, battleaxe Necromancy
1 niem Sorcerer light crossbow, battleaxe Necromancy
23 Zed Ryley Sorcerer sling Evocation
For comparison
Here are the scores which show how it ranks the results.
In [5]:
recommendationTable_df.head(6)
Out[5]:
28 1.000000
9 0.666667
18 0.666667
8 0.666667
1 0.666667
23 0.333333
dtype: float64
Code
#imports
import pandas as pd
import numpy as np
df = pd.read_excel('dnd-dataframe.xlsx', sheet_name=0, usecols=['name', 'weapons','herotype','spells'])
df.head(30)
dummies1 = df['weapons'].str.get_dummies(sep=',')
dummies2 = df['spells'].str.get_dummies(sep=',')
dummies3 = df['herotype'].str.get_dummies(sep=',')
genre_data = pd.concat([df, dummies1,dummies2, dummies3], axis=1)
userInput = [
{'name':'Irv', 'rating':1} #Their is no rating system being used so by default rating is set to 1
]
inputname = pd.DataFrame(userInput)
inputId = df[df['name'].isin(inputname['name'].tolist())]
#Then merging it so we can get the name. It's implicitly merging spells it by name.
inputname = pd.merge(inputId, inputname)
#Dropping information we won't use from the input dataframe
inputname = inputname.drop('weapons',1).drop('spells',1).drop('herotype',1)
#Filtering out the names from the input
username = genre_data[genre_data['name'].isin(inputname['name'].tolist())]
#Resetting the index to avoid future issues
username = username.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = username.drop('name',1).drop('weapons',1).drop('spells',1).drop('herotype',1)
#Dot product to get weights
userProfile = userGenreTable.transpose().dot(inputname['rating'])
genreTable = genre_data.copy()
genreTable = genreTable.drop('name',1).drop('weapons',1).drop('spells',1).drop('herotype',1)
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#df.loc[df.index.isin(recommendationTable_df.head(3).keys())] #adjust the value of 3 here
df.loc[recommendationTable_df.head(6).index, :]