KMeans question #269
Comments
|
Hi @ssimontacchi, What you are describing is a common problem of KMeans (not only when the custom variant to timeseries but also the scikit-learn variant has these issues). Therefore, the KMeans algorithm is often ran multiple times with different random initializations, some score such as silhouette_score is then used to decide which of all those random restarts was the most qualitative. On the other hand, the algorithm definitely should not hang indeed... I'll label this with "bug" for now! |
|
Would it be possible to construct some minimal example with a small dataset? This would help a lot to debug |
|
One important question is which init method are you using ? |
|
Good question... I think some check is required that checks if the total number of unique clusters is equal to the specified number of clusters, if that is not the case, some warning should be raised that the number of clusters is probably not set well (and perhaps also display the number of clusters with the highest silhouette that we found over the random initializations). Sklearn will just assign some random values to the cluster in case there is an empty cluster apparently. (source) |
In the link you provide, they state that it's not chosen at random btw:
We'll have to check. |
|
Hi, I have tried to make a minimal example but am having trouble recreating it (and can't share the datasets doing it, unfortunately). I believe the example in the colab is the general situation though. Again, it should be pretty fast on these datasets but just seems to hang. My guess of what is happening is that it's reassigning empty clusters indefinitely, but I'm not sure. Here is what the log looks like: It only shows "Resumed because of empty cluster" 10 times though, and I'm not sure why that would happen either if it were failing to reassign clusters infinitely. I'm trying to figure out what's different about this data and I'll let you know when I figure it out. Thanks! |
|
I'm really having a hard time getting it to hang but found that a lot, though not all, that were having this problem have tiny amounts of data. If you run it with only a few data points you will definitely get lots of empty cluster messages. I guess the question is figuring out if there is some condition that could cause it not to converge? |


Hi, Thanks for the awesome library!
So I am running a Kmeans on lots of different datasets, which all have roughly four shapes, so I initialize with those shapes and it works well, except for just a few times. There are a few datasets that look different enough that I end up with empty clusters and the algorithm just hangs ("Resumed because of empty cluster" again and again).
I conceptually understand why this happens, but is there any way you know to avoid it, or finish at least? I'm not sure I understand what's going on behind the scenes well enough to debug any further. Thank you!
The text was updated successfully, but these errors were encountered: