Revisions to Algorithm to identify degree of similarity based on prior categorization

added 177 characters in body

Source Link

edited Oct 2, 2015 at 20:27

32.8k
7
82
116

Sirisian's answer started along the right path, but then he lost me, so here is my take on the issue.

The fact that you are dealing with images is irrelevant, and the fact that one of the properties must be identical is also irrelevant. They are both red herrings. All I see here is entities with properties, and the problem can easily be redefined to exclude the property which must be identical, and to only consider the subset of entities that do in fact have the same value on that excluded property.

BasicallySo, basically, what you have is a set of entities where each entity has a list of N properties, and you have a separate entity which we will call the "reference" entity, and you want to search through that set to find entities that have as dissimilar as possible values from the reference entity.

(The fact that some property value must be identical is irrelevant because we will just redefine the problem to exclude that property and to consider only entities that already have the value of that property identical.)Essentially, you want a maximum distance in N-dimensional space algorithm.

So, firstFirst of all you need to compute the N-dimensional vector of the values of your reference entity. This is essentially the coordinates of a point in N-dimensional space. (If you want, you can imagine that N = 3, so you can be thinking of the problem in 3-dimensional space.)

Then, you need to loop through the rest of your entities, and for each entity you need to calculate its N-dimensional vector, (imagine another point in 3-dimensional space,) then you need to calculate the distance between the reference vector and this vector, which is going to be another N-vector, and then you need to take the absolute value, a.k.a. magnitude of that vector, which will be a single value, and store it.

Once all the magnitudes have been collected, you need to sort all your entities by this magnitude value, and your desired results will be gathered near the end of the list, where the largest magnitudes will be.

This is the standard way of solving the bulk of your problem, and I would strongly advise solving it precisely like that, so as to be implementing an algorithm which will be understandable by others, and also by yourself, should you revisit it a few months later.

Now, your particular situation has certain peculiarities:

The values of your properties are not nice real numbers, they are sets of discrete values. So, you will need to map them to real numbers. A range from 0 to 1 will work just fine, as Sirisian suggested. You can probably combine your discrete color and your discrete brightness together into three property values, whether they be HSV, HSL, or perhaps even RGB, I don't think it will matter much.
You want a variable number of results, and you want some randomness. So, select a percentage of entities at the end of the sorted list, and choose a specific number from them, at random.

Sirisian's answer started along the right path, but then he lost me, so here is my take on the issue.

The fact that you are dealing with images is irrelevant, and the fact that one of the properties must be identical is also irrelevant. They are both red herrings.

Basically, what you have is a set of entities where each entity has a list of N properties, and you have a separate entity which we will call the "reference" entity, and you want to search through that set to find entities that have as dissimilar as possible values from the reference entity.

(The fact that some property value must be identical is irrelevant because we will just redefine the problem to exclude that property and to consider only entities that already have the value of that property identical.)

So, first of all you need to compute the N-dimensional vector of the values of your reference entity. This is essentially the coordinates of a point in N-dimensional space. (If you want, you can imagine that N = 3, so you can be thinking of the problem in 3-dimensional space.)

Then, you need to loop through the rest of your entities, and for each entity you need to calculate its N-dimensional vector, (imagine another point in 3-dimensional space,) then you need to calculate the distance between the reference vector and this vector, which is going to be another N-vector, and then you need to take the absolute value, a.k.a. magnitude of that vector, which will be a single value, and store it.

Once all the magnitudes have been collected, you need to sort all your entities by this magnitude value, and your desired results will be gathered near the end of the list, where the largest magnitudes will be.

This is the standard way of solving the bulk of your problem, and I would strongly advise solving it precisely like that, so as to be implementing an algorithm which will be understandable by others, and also by yourself, should you revisit it a few months later.

Now, your particular situation has certain peculiarities:

The values of your properties are not nice real numbers, they are sets of discrete values. So, you will need to map them to real numbers. A range from 0 to 1 will work just fine, as Sirisian suggested. You can probably combine your discrete color and your discrete brightness together into three property values, whether they be HSV, HSL, or perhaps even RGB, I don't think it will matter much.
You want a variable number of results, and you want some randomness. So, select a percentage of entities at the end of the sorted list, and choose a specific number from them, at random.

Sirisian's answer started along the right path, but then he lost me, so here is my take on the issue.

The fact that you are dealing with images is irrelevant, and the fact that one of the properties must be identical is also irrelevant. They are both red herrings. All I see here is entities with properties, and the problem can easily be redefined to exclude the property which must be identical, and to only consider the subset of entities that do in fact have the same value on that excluded property.

So, basically, what you have is a set of entities where each entity has a list of N properties, and you have a separate entity which we will call the "reference" entity, and you want to search through that set to find entities that have as dissimilar as possible values from the reference entity.

Essentially, you want a maximum distance in N-dimensional space algorithm.

First of all you need to compute the N-dimensional vector of the values of your reference entity. This is essentially the coordinates of a point in N-dimensional space. (If you want, you can imagine that N = 3, so you can be thinking of the problem in 3-dimensional space.)

Then, you need to loop through the rest of your entities, and for each entity you need to calculate its N-dimensional vector, (imagine another point in 3-dimensional space,) then you need to calculate the distance between the reference vector and this vector, which is going to be another N-vector, and then you need to take the absolute value, a.k.a. magnitude of that vector, which will be a single value, and store it.

Once all the magnitudes have been collected, you need to sort all your entities by this magnitude value, and your desired results will be gathered near the end of the list, where the largest magnitudes will be.

This is the standard way of solving the bulk of your problem, and I would strongly advise solving it precisely like that, so as to be implementing an algorithm which will be understandable by others, and also by yourself, should you revisit it a few months later.

Now, your particular situation has certain peculiarities:

The values of your properties are not nice real numbers, they are sets of discrete values. So, you will need to map them to real numbers. A range from 0 to 1 will work just fine, as Sirisian suggested. You can probably combine your discrete color and your discrete brightness together into three property values, whether they be HSV, HSL, or perhaps even RGB, I don't think it will matter much.
You want a variable number of results, and you want some randomness. So, select a percentage of entities at the end of the sorted list, and choose a specific number from them, at random.

added 177 characters in body

Source Link

edited Oct 2, 2015 at 20:19

Mike Nakis

32.8k
7
82
116

Sirisian's answer started along the right path, but then he lost me, so here is my take on the issue.

The fact that you are dealing with images is irrelevant, and athe fact that one of the properties must be identical is also irrelevant. They are both red herringherrings.

Basically, what you have is a set of entities where each entity has a list of N properties, one of them isand you have a separate entity which we will call the "reference" entity, and you want to search through that set to find entities that have the same value as the reference entity on one of the properties, (call it the "fixed" property,) and as dissimilar as possible values from the reference entity on the rest of the properties.

Let us call(The fact that some property value must be identical is irrelevant because we will just redefine the number ofproblem to exclude that property and to consider only entities that already have the remaining properties M, meaningvalue of that M = N - 1property identical.)

So, first of all you need to compute the MN-dimensional vector of the values of your reference entity. This is essentially the coordinates of a point in MN-dimensional space. You(If you want, you can temporarily imagine that N = 4, therefore M = 3, so you can be thinking of the problem in 3-dimensional space.)

Then, you need to loop through the rest of your entities, and for each entity you need to calculate its MN-dimensional vector, (imagine another point in 3-dimensional space,) and then you need to calculate and store the distance between the reference vector and this vector., which is going to be another N-vector, and then you need to take the (Seeabsolute value, a.k.a. https://en.wikipedia.org/wiki/Euclidean_distance#n_dimensionsmagnitude of that vector, look for n-dimensionswhich will be a single value, and store it.)

Once all distancesthe magnitudes have been collected, you need to sort all your entities by this distancemagnitude value, and your desired results will all be gathered near the end of the list, where the largest distancesmagnitudes will be.

This is the standard way of solving the bulk of your problem, and I would strongly advise solving it precisely like that, so as to be implementing an algorithm which will be understandable by others, and also by yourself, should you revisit it a few months later.

Now, your particular situation has certain peculiarities:

The values of your properties are not nice real numbers, they are sets of discrete values. So, you will need to map them to real numbers. A range from 0 to 1 will work just fine, as Sirisian suggested. You can probably combine your discrete color and your discrete brightness together into three property values, whether they be HSV, HSL, or perhaps even RGB, I don't think it will matter much.
You want a variable number of results, and you want some randomness. So, select a percentage of entities at the end of the sorted list, and choose a specific number from them, at random.

Sirisian's answer started along the right path, but then he lost me, so here is my take on the issue.

The fact that you are dealing with images is irrelevant, and a red herring.

Basically, what you have is a set of entities where each entity has N properties, one of them is a "reference" entity, and you want to search through that set to find entities that have the same value as the reference entity on one of the properties, (call it the "fixed" property,) and as dissimilar as possible values from the reference entity on the rest of the properties.

Let us call the number of the remaining properties M, meaning that M = N - 1.

So, first of all you need to compute the M-dimensional vector of the values of your reference entity. This is essentially the coordinates of a point in M-dimensional space. You can temporarily imagine that N = 4, therefore M = 3, so you can be thinking of the problem in 3-dimensional space.

Then, you need to loop through the rest of your entities, and for each entity you need to calculate its M-dimensional vector, (imagine another point in 3-dimensional space,) and then you need to calculate and store the distance between the reference vector and this vector. (See https://en.wikipedia.org/wiki/Euclidean_distance#n_dimensions, look for n-dimensions.)

Once all distances have been collected, you need to sort all your entities by this distance value, and your desired results will all be gathered near the end of the list, where the largest distances will be.

This is the standard way of solving the bulk of your problem, and I would strongly advise solving it precisely like that, so as to be implementing an algorithm which will be understandable by others.

Now, your particular situation has certain peculiarities:

The values of your properties are not nice real numbers, they are sets of discrete values. So, you will need to map them to real numbers. A range from 0 to 1 will work just fine, as Sirisian suggested. You can probably combine your discrete color and your discrete brightness together into three property values, whether they be HSV, HSL, or perhaps even RGB, I don't think it will matter much.
You want a variable number of results, and you want some randomness. So, select a percentage of entities at the end of the sorted list, and choose a specific number from them, at random.