0

I am doing a project on movies data.
The sample dataset looks like :

enter image description here

The column genres have 21 unique values.

enter image description here

I want to create a new table/dataframe so that the table will contains the average ratings for the each genres for every user, like

enter image description here

I got the list of genres using the code below:

def split(sent):
    return (sent.split())

new_genres=set()

for i in range(len(genres)):
    a=split(genres[i])
    for g in a:
        new_genres.add(g)

new_genres
2
  • You can use pivot_table. Commented Feb 6, 2022 at 5:40
  • 1
    General advice : Screenshots won’t help anyone searching for similar issues, you should prefer a MCVE showcasing actual code that people can copy paste and study in their own environment. Commented Feb 6, 2022 at 8:37

2 Answers 2

1

Setup:

In [905]: df = pd.DataFrame({'userID':[1,2,3,3,2], 'id':[110, 147, 858, 1246, 1968], 'rating':[1.0, 4.5, 5.0, 5.0, 4.0], 'genres':['Drama Mystery Romance', 'Drama', 'Comedy Drama Romance', 'Drama', 'Drama Comedy Romance']}
     ...: )

In [906]: df
Out[906]: 
   userID    id  rating                 genres
0       1   110     1.0  Drama Mystery Romance
1       2   147     4.5                  Drama
2       3   858     5.0   Comedy Drama Romance
3       3  1246     5.0                  Drama
4       2  1968     4.0   Drama Comedy Romance

Use Series.str.split, df.explode with df.pivot_table:

In [907]: df['genres'] = df['genres'].str.split()
In [910]: res = df.explode('genres').pivot_table(index='userID', columns='genres', values='rating', aggfunc='mean').fillna(0, downcast='infer')

In [911]: res
Out[911]: 
genres  Comedy  Drama  Mystery  Romance
userID                                 
1            0   1.00        1        1
2            4   4.25        0        4
3            5   5.00        0        5
Sign up to request clarification or add additional context in comments.

Comments

1

We can start by using the assign method to get each genre in rows like so :

>>> df = df.assign(genre=df['genres'].str.split(' ')).explode('genre')
>>> df
    userId  id      rating  genres                  genre
0   1       110     1.0     Drama Mystery Romance   Drama
0   1       110     1.0     Drama Mystery Romance   Mystery
0   1       110     1.0     Drama Mystery Romance   Romance
1   1       147     4.5     Drama                   Drama
2   1       858     5.0     Comedy Drama Romance    Comedy
2   1       858     5.0     Comedy Drama Romance    Drama
2   1       858     5.0     Comedy Drama Romance    Romance
3   1       1246    5.0     Drama                   Drama
4   1       1968    4.0     Drama Comedy Romance    Drama
4   1       1968    4.0     Drama Comedy Romance    Comedy
4   1       1968    4.0     Drama Comedy Romance    Romance
5   270896  48780   5.0     Forein                  Forein
6   270896  49530   4.0     Action Thriller Scifi   Action
6   270896  49530   4.0     Action Thriller Scifi   Thriller
6   270896  49530   4.0     Action Thriller Scifi   Scifi
7   270896  54001   4.0     Drama                   Drama
8   270896  54503   4.0     Action Forein           Action
8   270896  54503   4.0     Action Forein           Forein
9   270896  58559   5.0     Drama                   Drama

Then we groupby userId and genre to get the mean of rating:

>>> df_grouped = df.groupby(['userId', 'genre'])['rating'].mean()
>>> df_grouped
userId  genre   
1       Comedy      4.500000
        Drama       3.900000
        Mystery     1.000000
        Romance     3.333333
270896  Action      4.000000
        Drama       4.500000
        Forein      4.500000
        Scifi       4.000000
        Thriller    4.000000
Name: rating, dtype: float64

To finish, we can use unstack to get the expected result :

>>> df_grouped.unstack(level=-1).fillna(0)
genre   Action  Comedy  Drama   Forein  Mystery     Romance     Scifi   Thriller
userId                              
1       0.0     4.5     3.9     0.0     1.0         3.333333    0.0     0.0
270896  4.0     0.0     4.5     4.5     0.0         0.000000    4.0     4.0

UPDATE :

If you want to add the missing genre given in your list, we can do the following :

>>> cols = ['Action',
...         'Adventure',
...         'Animation',
...         'Crime',
...         'Documentary',
...         'Family',
...         'Fantasy',
...         'History',
...         'Horror',
...         'Comedy', 
...         'Drama', 
...         'Forein', 
...         'Mystery', 
...         'Romance', 
...         'Scifi', 
...         'Thriller']
>>> df_res = df_res.reindex(df_res.columns.union(cols, sort=False), axis=1, fill_value=0)
>>> df_res
genre   Action  Comedy  Drama   Forein  Mystery     Romance     Scifi   Thriller    Adventure   Animation   Crime   Documentary     Family  Fantasy     History     Horror
userId                                                              
1       0.0     4.5     3.9     0.0     1.0         3.333333    0.0     0.0         0           0           0       0               0       0           0           0
270896  4.0     0.0     4.5     4.5     0.0         0.000000    4.0     4.0         0           0           0       0               0       0           0           0

1 Comment

Hi @NITISHPANDEY, I updated the answer to add the genre listed in your question even if those are missing in the data, hope it helps ;) !

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.