3

I am trying to impute missing values as the mean of other values in the column; however, my code is having no effect. Does anyone know what I may be doing wrong? Thanks!

My code:

  from sklearn.preprocessing import Imputer
    imputer = Imputer(missing_values ='NaN', strategy = 
    'mean', axis = 0)
    imputer = imputer.fit(x[:, 1:3])
    x[:, 1:3] = imputer.transform(x[:, 1:3])
    print(dataset)

Output

Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes

2 Answers 2

4

You can do the following, let's say df is your dataset:

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values ='NaN', strategy = 'mean', axis = 0)

df[['Age','Salary']]=imputer.fit_transform(df[['Age','Salary']])

print(df)

   Country        Age        Salary Purchased
0   France  44.000000  72000.000000        No
1    Spain  27.000000  48000.000000       Yes
2  Germany  30.000000  54000.000000        No
3    Spain  38.000000  61000.000000        No
4  Germany  40.000000  63777.777778       Yes
5   France  35.000000  58000.000000       Yes
6    Spain  38.777778  52000.000000        No
7   France  48.000000  79000.000000       Yes
8  Germany  50.000000  83000.000000        No
9   France  37.000000  67000.000000       Yes
Sign up to request clarification or add additional context in comments.

1 Comment

In new versions of sklearn use from sklearn.impute import SimpleImputer.
1

You're assigning an Imputer object to the variable imputer:

imputer = Imputer(missing_values ='NaN', strategy = 'mean', axis = 0)

You then call the fit() function on your Imputer object, and then the transform() function.

Then you print the dataset variable, which I'm not sure where it comes from. Did you mean to print the Imputer object, or the result of one of those calls instead?

1 Comment

Hey Danielle! so the dataset variable was created earlier in my code: dataset = pd.read_csv('Data1.csv'); I printed it in order to see whether or not the mean value was imputed in the age and salary columns to fill in for the NaN values. Printing it led to the output seen below my code. Upon printing the dataset, I saw that the NaN values were not replaced by the correct values, leading me to believe that my code had no effect.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.