A robust clustering model17 creates clusters with high intraclass similarity and low interclass similarity. However, it can be difficult to define cluster quality, and your selection of linkage criterion and cluster numbers can significantly impact your results. Thus, when building a clustering model, try out different options and select those that best help you explore and reveal patterns in the dataset for future consideration. Factors to consider18 include:
- The number of clusters that are practical or logical for the dataset (given dataset size, cluster shapes, noise and so on)
- Statistics, such as the mean, maximum and minimum values for each cluster
- The best dissimilarity metric or linkage criterion to apply
- The impact of any outliers or outcome variables
- Any specific domain or dataset knowledge
Other methods to help determine the optimal number of clusters19 include:
- The elbow method, where you plot the within-cluster sum of squares against the number of clusters and determine the "elbow" (the point where the plot levels off)
- Gap statistic, where you compare the actual within-cluster sum of squares to the expected within-cluster sum of squares for a null reference distribution and identify the largest gap.