In SelectKBest, what does length of get_support() represent?

Question

When reproducing this cross-validation example, I get for a 2x4 train matrix (xtrain) a len(b.get_support()) of 1 000 000. Does this mean 1 000 000 features have been created in the model? Or only 2, as the number of features that have an impact is 2. Thanks!

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
### create data
def hidden_model(x):
    #y is a linear combination of columns 5 and 10...
    result = x[:, 5] + x[:, 10]
    #... with a little noise
    result += np.random.normal(0, .005, result.shape)
    return result


def make_x(nobs):
    return np.random.uniform(0, 3, (nobs, 10 ** 6))

x = make_x(20)
y = hidden_model(x)

scores = []
clf = LinearRegression()

for train, test in KFold(len(y), n_folds=5):
    xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]

    b = SelectKBest(f_regression, k=2)
    b.fit(xtrain,ytrain)
    xtrain = xtrain[:, b.get_support()] #get_support: get mask or integer index of selected features
    xtest = xtest[:, b.get_support()]
    print len(b.get_support())

    clf.fit(xtrain, ytrain)
    scores.append(clf.score(xtest, ytest))

    yp = clf.predict(xtest)
    plt.plot(yp, ytest, 'o')
    plt.plot(ytest, ytest, 'r-')

plt.xlabel('Predicted')
plt.ylabel('Observed')

print("CV Score (R_square) is", np.mean(scores))

piman314 · Accepted Answer · 2016-02-05 16:47:14Z

3

It represents the mask that can be applied to your x to get the features that have been selected using the SelectKBest routine.

print x.shape
print b.get_support().shape
print np.bincount(b.get_support())

Outputs:

(20, 1000000)
(1000000,)
[999998      2]

Which shows you have 20 examples of 1000000 dimensional data, a boolean array of length 1000000 of which only two are ones.

Hope that helps!

answered Feb 5, 2016 at 16:47

piman314

5,35526 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

In SelectKBest, what does length of get_support() represent?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related