KMeans question #269

ssimontacchi · 2020-06-20T06:27:13Z

Hi, Thanks for the awesome library!

So I am running a Kmeans on lots of different datasets, which all have roughly four shapes, so I initialize with those shapes and it works well, except for just a few times. There are a few datasets that look different enough that I end up with empty clusters and the algorithm just hangs ("Resumed because of empty cluster" again and again).

I conceptually understand why this happens, but is there any way you know to avoid it, or finish at least? I'm not sure I understand what's going on behind the scenes well enough to debug any further. Thank you!

GillesVandewiele · 2020-06-20T06:30:52Z

Hi @ssimontacchi,

What you are describing is a common problem of KMeans (not only when the custom variant to timeseries but also the scikit-learn variant has these issues). Therefore, the KMeans algorithm is often ran multiple times with different random initializations, some score such as silhouette_score is then used to decide which of all those random restarts was the most qualitative.

On the other hand, the algorithm definitely should not hang indeed... I'll label this with "bug" for now!

GillesVandewiele · 2020-06-20T06:32:01Z

Would it be possible to construct some minimal example with a small dataset? This would help a lot to debug

rtavenar · 2020-06-20T07:56:51Z

One important question is which init method are you using ?
Then what would you suggest Gilles when a cluster is empty ? Does anyone know what sklearn does in this case ?

GillesVandewiele · 2020-06-20T08:15:19Z

Good question... I think some check is required that checks if the total number of unique clusters is equal to the specified number of clusters, if that is not the case, some warning should be raised that the number of clusters is probably not set well (and perhaps also display the number of clusters with the highest silhouette that we found over the random initializations).

Sklearn will just assign some random values to the cluster in case there is an empty cluster apparently. (source)

rtavenar · 2020-06-29T10:12:24Z

Sklearn will just assign some random values to the cluster in case there is an empty cluster apparently. (source)

In the link you provide, they state that it's not chosen at random btw:

A problem with k-means is that one or more clusters can be empty. However, this problem is accounted for in the current k-means implementation in scikit-learn. If a cluster is empty, the algorithm will search for the sample that is farthest away from the centroid of the empty cluster. Then it will reassign the centroid to be this farthest point.

We'll have to check.

ssimontacchi · 2020-07-01T18:50:26Z

Hi, I have tried to make a minimal example but am having trouble recreating it (and can't share the datasets doing it, unfortunately). I believe the example in the colab is the general situation though.

Again, it should be pretty fast on these datasets but just seems to hang. My guess of what is happening is that it's reassigning empty clusters indefinitely, but I'm not sure. Here is what the log looks like:

WARNING: QApplication was not created in the main() thread.
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster

It only shows "Resumed because of empty cluster" 10 times though, and I'm not sure why that would happen either if it were failing to reassign clusters infinitely.

I'm trying to figure out what's different about this data and I'll let you know when I figure it out. Thanks!

ssimontacchi · 2020-07-03T04:13:24Z

I'm really having a hard time getting it to hang but found that a lot, though not all, that were having this problem have tiny amounts of data. If you run it with only a few data points you will definitely get lots of empty cluster messages.

I guess the question is figuring out if there is some condition that could cause it not to converge?

GillesVandewiele added the bug label Jun 20, 2020

rtavenar added the good first issue label Jun 20, 2020

Jun	JUL	Aug
	26
2020	2021	2022

tslearn-team / tslearn

KMeans question #269

KMeans question #269

ssimontacchi commented Jun 20, 2020

GillesVandewiele commented Jun 20, 2020

GillesVandewiele commented Jun 20, 2020

rtavenar commented Jun 20, 2020

GillesVandewiele commented Jun 20, 2020

rtavenar commented Jun 29, 2020

ssimontacchi commented Jul 1, 2020 •

edited

ssimontacchi commented Jul 3, 2020 •

edited

tslearn-team / tslearn

KMeans question #269

KMeans question #269

Comments

ssimontacchi commented Jun 20, 2020

GillesVandewiele commented Jun 20, 2020

GillesVandewiele commented Jun 20, 2020

rtavenar commented Jun 20, 2020

GillesVandewiele commented Jun 20, 2020

rtavenar commented Jun 29, 2020

ssimontacchi commented Jul 1, 2020 • edited

ssimontacchi commented Jul 3, 2020 • edited

ssimontacchi commented Jul 1, 2020 •

edited

ssimontacchi commented Jul 3, 2020 •

edited