Cosine similarity

Question

Is there anything to make the cosine similarity function more optimized in terms of CPU execution time?

double cosine_similarity(double *A, double *B, unsigned int size)
{
    double mul = 0.0, d_a = 0.0, d_b = 0.0 ;

    for(unsigned int i = 0; i < size; ++i) 
    {
        mul += A[i] * B[i] ;
        d_a += A[i] * A[i] ;
        d_b += B[i] * B[i] ;
    }
    return mul / (sqrt(d_a) * sqrt(d_b)) ;
}

Depending on the size of A and B, passing by (const) reference could be faster, since there is less memory to be copied onto the stack. — M.Hoffmann
– M.Hoffmann, Commented Sep 24, 2016 at 11:41
Also you could reverse the loop, as 'for(int i = size; i--){...}', so you dont have to compare i to size in every step. — M.Hoffmann
– M.Hoffmann, Commented Sep 24, 2016 at 11:56
@M.Hoffmann Comments are for seeking clarification to the question. Please write all suggestions as answers. — 200_success
– 200_success, Commented Sep 24, 2016 at 13:49

Jamal · Accepted Answer · 2016-09-27 15:30:10Z

Check for Possible Errors

There are a number of possible errors in the code:

The variable size may be zero which will lead to division by zero.
The vectors A and B may not be the same length, if the size variable is the size of the shorter vector this is not a problem, but if the size variable is the size of the larger vector this can lead to unknown behavior.
Either d_a or d_b can add up to zero, which can lead to division by zero as @coderodde pointed out.

C++ provides the exception class std:exception that has multiple sub classes. Two of the classes are std::runtime_error and std::logic_error. Item 1 and 2 above belong in the std::logic_error class, item 3 belongs in the std::runtime_error class.

Code Readability and Maintainability

The code has:

double mul = 0.0, d_a = 0.0, d_b = 0.0 ;

This is less readable and maintainable than putting the initialization on separate lines:

double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;

Direct Addressing Versus Indirect Addressing

The the most efficient way to write the function is to use direct addressing rather than indirect addressing. Since the code is already passing pointers use the pointers rather than indexing them.

double cosine_similarity(double *A, double *B, unsigned int size)
{
    double mul = 0.0;
    double d_a = 0.0;
    double d_b = 0.0 ;

    for(unsigned int i = 0; i < size; ++i)
    {
        mul += *A * *B;
        d_a += *A * *A;
        d_b += *B * *B;
        A++;
        B++;
    }

    if (d_a == 0.0f || d_b == 0.0f)
    {
        throw std::logic_error(
                "cosine similarity is not defined whenever one or both "
                "input vectors are zero-vectors.");
    }

    return mul / (sqrt(d_a) * sqrt(d_b)) ;
}

Using Container Classes Can be Safer as Well as Efficient

The container classes such as std::vector and std::array provide safer containers than C style arrays. If the number of elements in the array are know, the std::array class is more efficient. If the number of elements in the array are not known std::vector provides a flexible as well as convenient way to store values.

These container classes provide additional information such as the size of the array/vector and iterators for direct addressing. They also allow for indexing through the 'array'.

#include <math.h>
#include <vector>
#include <stdexcept>

double cosine_similarity_vectors(std::vector<double> A, std::vector<double>B)
{
    double mul = 0.0;
    double d_a = 0.0;
    double d_b = 0.0 ;

    if (A.size() != B.size())
    {
        throw std::logic_error("Vector A and Vector B are not the same size");
    }

    // Prevent Division by zero
    if (A.size() < 1)
    {
        throw std::logic_error("Vector A and Vector B are empty");
    }

    std::vector<double>::iterator B_iter = B.begin();
    std::vector<double>::iterator A_iter = A.begin();
    for( ; A_iter != A.end(); A_iter++ , B_iter++ )
    {
        mul += *A_iter * *B_iter;
        d_a += *A_iter * *A_iter;
        d_b += *B_iter * *B_iter;
    }

    if (d_a == 0.0f || d_b == 0.0f)
    {
        throw std::logic_error(
                "cosine similarity is not defined whenever one or both "
                "input vectors are zero-vectors.");
    }

    return mul / (sqrt(d_a) * sqrt(d_b));
}

As far as i know you can put B_iter and A_iter into the for loop as they are the same type. for( std::vector<double>::iterator A_iter = A.begin(), B_iter = B.begin(); A_iter != A.end(); A_iter++ , B_iter++ ) — miscco
– miscco, Commented Sep 25, 2016 at 20:50

coderodde · Accepted Answer · 2016-09-24 13:43:48Z

8

Since $$\sqrt{D_a} \sqrt{D_b} = \sqrt{D_a D_b}$$ you can get rid of one call to sqrt

double cosine_similarity2(double *A, double *B, unsigned int size)
{
    double mul = 0.0, d_a = 0.0, d_b = 0.0 ;

    for(unsigned int i = 0; i < size; ++i)
    {
        mul += A[i] * B[i] ;
        d_a += A[i] * A[i] ;
        d_b += B[i] * B[i] ;
    }

    if (d_a == 0 || d_b == 0)
    {
        throw runtime_error(
                "cosine similarity is not defined whenever one or both "
                "input vectors are zero-vectors.");
    }

    return mul / (sqrt(d_a * d_b)) ;
}

Minor

Usually people don't put a single space before the semicolon ;. However, usually a single space is put straight after for:

for (int i = 0; ..., ++i) 
{  ^
    ...
}

Divide by zero

Note that sqrt(d_a * d_b) == 0 only when at least one of the input vectors is a zero vector. However, cosine similarity assumes that the two input vectors are not zero-vectors, so it makes sense to me to throw an exception when it happens.

Divide by zero ++

As 200_success advised, you can omit the entire exception-throwing code and just proceed to division by zero: that will return a special NaN value ("Not a Number").

edited Sep 24, 2016 at 13:43

answered Sep 24, 2016 at 10:10

coderodde

32k15 gold badges78 silver badges204 bronze badges

\$\begingroup\$ How can I get rid of divide by zero as (d_a * d_b) can be zero if all the elements of the two array is all zero? \$\endgroup\$

MD. Nazmul Kibria
– MD. Nazmul Kibria

2016-09-24 10:16:57 +00:00
Commented Sep 24, 2016 at 10:16
1

\$\begingroup\$ 1) I'm not convinced that an exception would be better than a NaN result. 2) sqrt(a * b) is not the same as sqrt(a) * sqrt(b) if a or b is negative. \$\endgroup\$

200_success
– 200_success

2016-09-24 13:39:20 +00:00
Commented Sep 24, 2016 at 13:39
1

\$\begingroup\$ @200_success Take another look at \$a\$ or \$b\$. They can't be negative. However, thank you for your suggestion; I will add it to my answer. \$\endgroup\$

coderodde
– coderodde

2016-09-24 13:41:48 +00:00
Commented Sep 24, 2016 at 13:41
\$\begingroup\$ @200_success I am aware of existence of NaN, yet I would personally choose to avoid it since it "infects" everything it comes to operate with, yielding (silently) a NaN result. \$\endgroup\$

coderodde
– coderodde

2016-09-24 13:46:33 +00:00
Commented Sep 24, 2016 at 13:46
\$\begingroup\$ You might also want to benchmark the results of splitting the single loop into three, if this function will be used on large vectors or very frequently. \$\endgroup\$

Alex Reinking
– Alex Reinking

2016-09-24 19:08:27 +00:00
Commented Sep 24, 2016 at 19:08

Add a comment |

Jamal · Accepted Answer · 2016-09-27 15:27:57Z

1

Calculating cosine similarity between \$n\$ things only require calculating your square root (\$d\$) for \$n\$ times instead of \$n \space x \space n\$ times.

If you write a function that calculate cosine similarity between n things instead of just two things, you can save some time on calculating the square root.

Alternatively, write a function that calculate cosine similarity between two things with an array of the square roots as an argument.

edited Sep 27, 2016 at 15:27

Jamal

35.2k13 gold badges134 silver badges238 bronze badges

answered Sep 27, 2016 at 13:52

hamster on wheels

3483 silver badges11 bronze badges

Add a comment |

Stack Exchange Network

Cosine similarity

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Cosine similarity

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions