1

Suppose that I have an array

a = np.array([[1,2.5,3,4],[1, 2.5, 3,3]])

I want to find the mode of each column without using stats.mode().

The only way I can think of is the following:

result = np.zeros(a.shape[1])
for i in range(len(result)):
    curr_col = a[:,i]
    result[i] = curr_col[np.argmax(np.unique(curr_col, return_counts = True))]

update: There is some error in the above code and the correct one should be:

   values, counts = np.unique(a[:,i], return_counts = True)
   result[i] = values[np.argmax(counts)]

I have to use the loop because np.unique does not output compatible result for each column and there is no way to use np.bincount because the dtype is not int.

2
  • if you try with a=np.array([[1,5,3,4],[1,5,3,3],[1,5,3,3]]) you will get the result=[1. 5. 3. 4.] which is not correct. The last column mode is 3. Commented Jun 29, 2020 at 21:27
  • Thanks for pointing that out and it should be correct now. Commented Jun 29, 2020 at 22:09

1 Answer 1

1

If you look at the numpy.unique documentation, this function returns the values and the associated counts (because you specified return_counts=True). A slight modification of your code is necessary to give the correct result. What you are trying todo is to find the value associated to the highest count:

import numpy as np
a = np.array([[1,5,3,4],[1,5,3,3],[1,5,3,3]])
result = np.zeros(a.shape[1])
for i in range(len(result)):
  values, counts = np.unique(a[:,i], return_counts = True)
  result[i] = values[np.argmax(counts)]
print(result)

Output:

% python3 script.py
[1. 5. 3. 4.]

Here is a code tha compares your solution with the scipy.stats.mode function:

import numpy as np
import scipy.stats as sps
import time

a = np.random.randint(1,100,(100,100))

t_start = time.time()
result = np.zeros(a.shape[1])
for i in range(len(result)):
  values, counts = np.unique(a[:,i], return_counts = True)
  result[i] = values[np.argmax(counts)]
print('Timer 1: ', (time.time()-t_start), 's')

t_start = time.time()
result_2 = sps.mode(a, axis=0).mode
print('Timer 2: ', (time.time()-t_start), 's')

print('Matrices are equal!' if np.allclose(result, result_2) else 'Matrices differ!')

Output:

% python3 script.py
Timer 1:  0.002721071243286133 s
Timer 2:  0.003339052200317383 s
Matrices are equal!

I tried several values for parameters and your code is actually faster than scipy.stats.mode function so it is probably close to optimal.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you!!! That is a very thorough answer. Do you know what is a good rule of thumb in telling whether the code can be optimized or not? I am new to numpy, so every time I write a piece of code I start to worry whether it is efficient or not but I just have no good way to tell.
You are welcome. I think it's a very difficult question. In general, depending on the application, you should stop optimizing at soon as the computation time is acceptable because optimization often implies a complexification of the code. However you have a good state of mind because you should always worry about the efficiency of your code. If you come back to the code above, the optimization lies mainly in the np.unique function that probably use some tree structure to perform counts of an unsorted array. For instance, if your array was sorted, may have been able to go even faster.
You may start to implement a code that works. Often it possible to cut the problem into known subproblems like you did here with np.unique. You may always focus on complexity instead of computation time: here np.unique have a O(N log(N)) for unsorted arrays and O(N) for sorted arrays for which it is way easier to compute counts. If you are dealing with large quantity of data (if a is a matrix of size 1000000x1000000), probably the next step will be to work on parallelization on multiple CPUs or GPUs but this will complexify the code a lot.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.