KMeans clustering - Value error: n_samples=1 should be >= n_cluster

Question

I am doing an experiment with three time-series datasets with different characteristics for my experiment whose format is as the following.

    0.086206438,10
    0.086425551,12
    0.089227066,20
    0.089262508,24
    0.089744425,30
    0.090036815,40
    0.090054172,28
    0.090377569,28
    0.090514071,28
    0.090762872,28
    0.090912691,27

The first column is a timestamp. For reproducibility reasons, I am sharing the data here. From column 2, I wanted to read the current row and compare it with the value of the previous row. If it is greater, I keep comparing. If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger). Accordingly, here is the code:

import numpy as np
import matplotlib.pyplot as plt

protocols = {}

types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }

    plt.figure(); plt.clf()
    plt.plot(quotient_times,quotient, ".", label=protname, color="blue")
    plt.ylim(0, 1.0001)
    plt.title(protname)
    plt.xlabel("time")
    plt.ylabel("quotient")
    plt.legend()
    plt.show()

And this produces the following three points - one for each dataset I shared.

As we can see from the points in the plots based on the code given above, data1 is pretty consistent whose value is around 1, data2 will have two quotients (whose values will concentrate either around 0.5 or 0.8) and the values of data3 are concentrated around two values (either around 0.5 or 0.7). This way, given a new data point (with quotient and quotient_times), I want to know which cluster it belongs to by building each dataset stacking these two transformed features quotient and quotient_times. I am trying it with KMeans clustering as the following

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)

But this is giving me an error: ValueError: n_samples=1 should be >= n_clusters=3. How can we fix this error?

Update: samlpe quotient data = array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129, 0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 , 0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581, 0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])

desertnaut · Accepted Answer · 2019-02-22 22:39:13Z

As is, your quotient variable is now one single sample; here I get a different error message, probably due to different Python/scikit-learn version, but the essence is the same:

import numpy as np
quotient = np.array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129, 0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 , 0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581, 0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
quotient.shape
# (20,)

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)

This gives the following error:

ValueError: Expected 2D array, got 1D array instead:
array=[0.7        0.7        0.4973262  0.7008547  0.71287129 0.704
 0.49723757 0.49723757 0.70676692 0.5        0.5        0.70754717
 0.5        0.49723757 0.70322581 0.5        0.49723757 0.49723757
 0.5        0.49723757].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

which, despite the different wording, is not different from yours - essentially it says that your data look like a single sample.

Following the first advice(i.e. considering that quotient contains a single feature (column) resolves the issue:

k_means.fit(quotient.reshape(-1,1))
# result
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

Gustavo Fonseca · Accepted Answer · 2019-02-23 12:41:23Z

Please try the code below. A brief explanation on what I've done:

First I built the dataset sample = np.vstack((quotient_times, quotient)).T and standardized it, so it would become easier to cluster. Following, I've applied DBScan with multiple hyperparameters (eps and min_samples) until I've found the one that separated the points better. Finally, I've plotted the data with its respective labels, since you are working with 2 dimensional data, it's easy to visualize how good the clustering is.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}

dataset = np.empty((0, 2))

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T

    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    sample = np.vstack((quotient_times, quotient)).T
    dataset = np.append(dataset, sample, axis=0)

scaler = StandardScaler()
dataset = scaler.fit_transform(dataset)

k_means = DBSCAN(eps=0.6, min_samples=1)
k_means.fit(dataset)

colors = [i for i in k_means.labels_]

plt.figure();
plt.title('Dataset 1,2,3')
plt.xlabel("time")
plt.ylabel("quotient")
plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)
plt.legend()
plt.show()

thank you, you are awesome. but why do we have negative quotients? It should be a number between 0 and 1. Is it also possible to plot 3 of them in one figure so that we can see how the cluster looks.
I'ts because I've applied the scaler.fit_transform(dataset) snippet. If you want to know more about it, please refer to Feature scaling. It's definitely possible, you just have to combine all the datasets before applying DBScan.

Mohit Gaikwad · Accepted Answer · 2022-11-16 16:56:38Z

-1

You are trying to make 3 clusters, while you have only 1 np.array i.e n_samples.

Try increasing the no. of arrays.
Decreasing no. of clusters.
Reshaping the array (not sure)

answered Nov 16, 2022 at 16:56

Mohit Gaikwad

3502 silver badges11 bronze badges

Collectives™ on Stack Overflow

KMeans clustering - Value error: n_samples=1 should be >= n_cluster

3 Answers 3

Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Linked

Related