8

I am using python3 (spyder), and I have a table which is the type of object "pandas.core.frame.DataFrame". I want to z-score normalize the values in that table (to each value substract the mean of its row and divide by the sd of its row), so each row has mean=0 and sd=1. I have tried 2 approaches.

First approach

from scipy.stats import zscore
zetascore_table=zscore(table,axis=1)

Second approach

rows=table.index.values
columns=table.columns
import numpy as np
for i in range(len(rows)):
    for j in range(len(columns)):
         table.loc[rows[i],columns[j]]=(table.loc[rows[i],columns[j]] - np.mean(table.loc[rows[i],]))/np.std(table.loc[rows[i],])
table

Both approaches seem to work, but when I check the mean and sd of each row it is not 0 and 1 as it is suppose to be, but other float values. I don´t know which can be the problem.

Thanks in advance for your help!

1
  • Maybe worth noting that, (a) df['z score'] = zscore(df['col A']) and (b) df['z score'] = (df['col A']-df['col A'].mean())/df['col A'].std() do not give exactly the same z-scores. (a) uses zero degrees of freedom and (b) uses 1 degree of freedom for the std dev by default. Depending on application, you can set the ddof equal--eg using df['col A'].std(ddof=0) in (b) will make them equal (default with zscore() is ddof=0). See stackoverflow.com/questions/59668597/… for ddof. Commented Jun 10, 2023 at 21:41

2 Answers 2

14

The code below calculates a z-score for each value in a column of a pandas df. It then saves the z-score in a new column (here, called 'num_1_zscore'). Very easy to do.

from scipy.stats import zscore
import pandas as pd

# Create a sample df
df = pd.DataFrame({'num_1': [1,2,3,4,5,6,7,8,9,3,4,6,5,7,3,2,9]})

# Calculate the zscores and drop zscores into new column
df['num_1_zscore'] = zscore(df['num_1'])

display(df)
Sign up to request clarification or add additional context in comments.

Comments

2

Sorry, thinking about it I found myself another easier way to calculate z-score (substract the mean of each row and divide the result by the sd of the row) than the for loops:

table=table.T# need to transpose it since the functions work like that 
sd=np.std(table)
mean=np.mean(table)
numerator=table-mean #numerator in the formula for z-score 
z_score=numerator/sd
z_norm_table=z_score.T #we transpose again and we have the initial table but with all the 
#values z-scored by row. 

I checked and now mean in each row is 0 or very close to 0 and sd is 1 or very close to 1, so like that was working for me. Sorry, I have few experience with coding and sometimes easy things require a lot of trials until I figure out how to solve them.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.