different scores when using scikit-learn pipeline vs. doing it manually

Question

Simple example below using minmaxscaler, polyl features and linear regression classifier.

doing via pipeline:

pipeLine = make_pipeline(MinMaxScaler(),PolynomialFeatures(), LinearRegression())

pipeLine.fit(X_train,Y_train)
print(pipeLine.score(X_test,Y_test))
print(pipeLine.steps[2][1].intercept_)
print(pipeLine.steps[2][1].coef_)

0.4433729905419167
3.4067909278765605
[ 0.         -7.60868833  5.87162697]

doing manually:

X_trainScaled = MinMaxScaler().fit_transform(X_train)
X_trainScaledandPoly = PolynomialFeatures().fit_transform(X_trainScaled)

X_testScaled = MinMaxScaler().fit_transform(X_test)
X_testScaledandPoly = PolynomialFeatures().fit_transform(X_testScaled)

reg = LinearRegression()
reg.fit(X_trainScaledandPoly,Y_train)
print(reg.score(X_testScaledandPoly,Y_test))
print(reg.intercept_)
print(reg.coef_)
print(reg.intercept_ == pipeLine.steps[2][1].intercept_)
print(reg.coef_ == pipeLine.steps[2][1].coef_)

0.44099256691782807
3.4067909278765605
[ 0.         -7.60868833  5.87162697]
True
[ True  True  True]

Is is possible that X_test and X_train have different min/max values? Can you try it with a defined dataset and add it to your question? — Maximilian Peters
– Maximilian Peters, Commented Jul 27, 2019 at 9:34
You are not supposed to fit_transform twice. You are supposed to fit using the training data and then ONLY call transform for the test data. — Axois
– Axois, Commented Jul 27, 2019 at 13:47

hellpanderr · Accepted Answer · 2019-07-27 13:10:42Z

The problem lies in your manual steps, where you do the refitting of the Scaler using test data, you need to fit it on train data and use fitted instance on test data, see here for details: How to normalize the Train and Test data using MinMaxScaler sklearn and StandardScaler before and after splitting data

from sklearn.datasets import make_classification, make_regression
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

X, y = make_regression(n_features=3, n_samples=50, n_informative=1, noise=1)
X_train, X_test, Y_train, Y_test = train_test_split(X, y)

pipeLine = make_pipeline(MinMaxScaler(),PolynomialFeatures(), LinearRegression())

pipeLine.fit(X_train,Y_train)
print(pipeLine.score(X_test,Y_test))
print(pipeLine.steps[2][1].intercept_)
print(pipeLine.steps[2][1].coef_)

scaler = MinMaxScaler().fit(X_train)
X_trainScaled = scaler.transform(X_train)
X_trainScaledandPoly = PolynomialFeatures().fit_transform(X_trainScaled)


X_testScaled = scaler.transform(X_test)
X_testScaledandPoly = PolynomialFeatures().fit_transform(X_testScaled)

reg = LinearRegression()
reg.fit(X_trainScaledandPoly,Y_train)
print(reg.score(X_testScaledandPoly,Y_test))
print(reg.intercept_)
print(reg.coef_)
print(reg.intercept_ == pipeLine.steps[2][1].intercept_)
print(reg.coef_ == pipeLine.steps[2][1].coef_)

Collectives™ on Stack Overflow

different scores when using scikit-learn pipeline vs. doing it manually

doing via pipeline:

doing manually:

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

doing via pipeline:

doing manually:

1 Answer 1

Comments

Linked

Related