Creating Best Clusters of Objects Based on Distance Between Them

Question

I have an array of images. And, there is a function that calculates the distance between two images.

I wish to cluster the images based on this distance. So the clusters contain images that are all at short distance to each other.

So only the distance between two images can be used to form the clusters. An image has no usable properties on their own. As many common clustering algorithms expect objects to have usable properties, it seems I can use these algorithms.

To limit computational complexity, I've introduced a few thresholds to decide if two images may end up in the same cluster:

calculated distance (I want to optimize for shortest distance, but above a certain distance, images are not close enough to ever share a cluster)
input array distance (e.g. an image at index 23 may only cluster with images up to index 123)
creation data interval (images created more than N hours apart, may not share a cluster)

Some details:

There can be up to 100k images.
Roughly 25-50% of images will be 'unique' and do not end up in a cluster with others. Of the remainder, most will be expected to end up in clusters with a size below 10, with 5% outliers above that.
The maximum cluster size is something I want to set somewhere between 10 and 50.

Given the above and the nature of the image set, a graph (image: edge, calculated distance: vertex) would consist of disjoint parts consisting of up to a few hundred images. So the algorithm needs to be able to cluster sets of well below 1000 images.

The clusters must be optimised for distance, such that closer images end up together in a cluster. I'm aware that parameters like minimum and maximum cluster size would be needed; it's a play between creating a few large clusters with a higher maximum distance, or more smaller clusters with images that are closer together.

What kind of algorithm(s) could I use?

How many objects are we taking about here? Is the underlying graph dense or sparse? Are you looking to minimize the number of clusters? Do you accept approximate solutions? Would you accept clusters where the maximum distance is twice the threshold? — Ainsley H.
– Ainsley H., Commented Jul 29, 2018 at 14:47
@PålGD Thanks! I've added details. Yes, approximate solutions are totally acceptable. Since the maximum distance between any two objects in a cluster must be below the threshold, I don't understand your question about twice the threshold. — meaning-matters
– meaning-matters, Commented Jul 29, 2018 at 15:19

Ainsley H. · Accepted Answer · 2018-07-30 08:11:50Z

1

You can use Cluster Edge Deletion, however, the problem is NP-hard.

Create a graph where the vertex set is the set of your objects, and you create an edge between two objects if the distance is at most the threshold.

Now, the Cluster Edge Deletion problem is to find the set with fewest edges to delete such that the remaining graph is a cluster graph (a graph whose connected components are cliques). In this solution, every object belongs to a cluster in which the distance to every other object is bounded by the threshold.

Alternatively, you can run $k$-means (with increasing $k$) until you're satisfied with a solution. Take note that the $k$-means algorithm is randomized and gives an approximate solution (as also this problem is NP-hard), so you might want to run the algorithm several times per $k$.

edited Jul 30, 2018 at 8:11

answered Jul 29, 2018 at 14:42

Ainsley H.

18k3 gold badges44 silver badges68 bronze badges

$\begingroup$ I find 'Cluster Vertex Deletion', is that what you mean? $\endgroup$

meaning-matters
– meaning-matters

2018-07-29 15:36:02 +00:00
Commented Jul 29, 2018 at 15:36
$\begingroup$ How would k-Means work with my objects that don't have intrinsic properties to cluster on? $\endgroup$

meaning-matters
– meaning-matters

2018-07-29 15:43:15 +00:00
Commented Jul 29, 2018 at 15:43
$\begingroup$ The Cluster Deletion you describe does not optimize to form clusters that have low distances. $\endgroup$

meaning-matters
– meaning-matters

2018-07-29 16:12:20 +00:00
Commented Jul 29, 2018 at 16:12
$\begingroup$ No, sorry. I meant Cluster Edge Deletion, in which you are allowed only to delete edges. $\endgroup$

Ainsley H.
– Ainsley H.

2018-07-30 08:12:26 +00:00
Commented Jul 30, 2018 at 8:12
$\begingroup$ The $k$-Means algorithm works as long as you have a distance function between two objects. 100k is quite a lot, so if you need to compare all pairs ($n^2$) many, you have ~10bn ($10^{10}$). That is a lot, but depending on the algorithm and your available computer power, it is doable. $\endgroup$

Ainsley H.
– Ainsley H.

2018-07-30 08:14:30 +00:00
Commented Jul 30, 2018 at 8:14

Add a comment |

D.W. · Accepted Answer · 2018-07-30 21:33:54Z

1

There are several standard clustering algorithms you could try, including k-means and hierarchical clustering. The additional restrictions you have should make both implementable in a reasonably efficient way. In particular, build a graph with one vertex per image, and an edge between each pair of images that meet all the thresholds. Note that this can be done using at most $100 \times 100,000$ distance computations: you only need to compare each image to the 100 that precede it.

answered Jul 30, 2018 at 21:33

D.W.♦

169k23 gold badges236 silver badges519 bronze badges

$\begingroup$ As my data is not hierarchical, why could hierarchical clustering be good? As the number of clusters is not known, why would k-means apply? $\endgroup$

meaning-matters
– meaning-matters

2018-07-31 04:44:52 +00:00
Commented Jul 31, 2018 at 4:44
$\begingroup$ @meaning-matters, hierarchical clustering has nothing to do with whether the data is hierarchical. Perhaps ignore the name and learn more about the technique. (The hierarchy is in the set of clusters, but you can ignore the hierarchy in the clusters and just use it to get a set of clusters.) There are standard methods for determining the number of clusters to use with k-means; see e.g. en.wikipedia.org/wiki/…. $\endgroup$

D.W.
– D.W. ♦

2018-07-31 13:26:10 +00:00
Commented Jul 31, 2018 at 13:26

Add a comment |

Stack Exchange Network

Creating Best Clusters of Objects Based on Distance Between Them

2 Answers 2

Hot Network Questions

Creating Best Clusters of Objects Based on Distance Between Them

2 Answers 2

Related

Hot Network Questions