I have an array of images. And, there is a function that calculates the distance between two images.
I wish to cluster the images based on this distance. So the clusters contain images that are all at short distance to each other.
So only the distance between two images can be used to form the clusters. An image has no usable properties on their own. As many common clustering algorithms expect objects to have usable properties, it seems I can use these algorithms.
To limit computational complexity, I've introduced a few thresholds to decide if two images may end up in the same cluster:
- calculated distance (I want to optimize for shortest distance, but above a certain distance, images are not close enough to ever share a cluster)
- input array distance (e.g. an image at index 23 may only cluster with images up to index 123)
- creation data interval (images created more than N hours apart, may not share a cluster)
Some details:
- There can be up to 100k images.
- Roughly 25-50% of images will be 'unique' and do not end up in a cluster with others. Of the remainder, most will be expected to end up in clusters with a size below 10, with 5% outliers above that.
- The maximum cluster size is something I want to set somewhere between 10 and 50.
Given the above and the nature of the image set, a graph (image: edge, calculated distance: vertex) would consist of disjoint parts consisting of up to a few hundred images. So the algorithm needs to be able to cluster sets of well below 1000 images.
The clusters must be optimised for distance, such that closer images end up together in a cluster. I'm aware that parameters like minimum and maximum cluster size would be needed; it's a play between creating a few large clusters with a higher maximum distance, or more smaller clusters with images that are closer together.
What kind of algorithm(s) could I use?