3

Given the following array:

a = np.array([[1,2,3],[4,5,6],[7,8,9]])

[[1 2 3]
 [4 5 6]
 [7 8 9]]

How can I replace certain values with other values?

bad_vals = [4, 2, 6]
update_vals = [11, 1, 8]

I currently use:

for idx, v in enumerate(bad_vals):
    a[a==v] = update_vals[idx]

Which gives:

[[ 1  1  3]
 [11  5  8]
 [ 7  8  9]]

But it is rather slow for large arrays with many values to be replaced. Is there any good alternative?

The input array can be changed to anything (list of list/tuples) if this might be necessary to access certain speedy black magic.

EDIT:

Based on the great answers from @Divakar and @charlysotelo did a quick comparison for my real use-case date using the benchit package. My input data array has more or less a of ratio 100:1 (rows:columns) where the length of array of replacement values are in order of 3 x rows size.

Functions:

# current approach
def enumerate_values(a, bad_vals, update_vals):
    for idx, v in enumerate(bad_vals):
        a[a==v] = update_vals[idx]
    return a

# provided solution @Divakar
def map_values(a, bad_vals, update_vals):
    N = max(a.max(), max(bad_vals))+1
    mapar = np.empty(N, dtype=int)
    mapar[a] = a
    mapar[bad_vals] = update_vals
    out = mapar[a]
    return out

# provided solution @charlysotelo
def vectorize_values(a, bad_vals, update_vals):
    bad_to_good_map = {}
    for idx, bad_val in enumerate(bad_vals):
        bad_to_good_map[bad_val] = update_vals[idx]
    f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
    a = f(a)

    return a

# define benchit input functions
import benchit
funcs = [enumerate_values, map_values, vectorize_values]

# define benchit input variables to bench against
in_ = {
    n: (
        np.random.randint(0,n*10,(n,int(n * 0.01))), # array
        np.random.choice(n*10, n*3,replace=False), # bad_vals
        np.random.choice(n*10, n*3) # update_vals
    ) 
    for n in [300, 1000, 3000, 10000, 30000]
}

# do the bench
# btw: timing of bad approaches (my own function here) take time
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, grid=False)

timings benchit

4
  • Are the values (positive) integral? Can we thus make a list like [0,1,1,3,11,5,8] (that thus defines the mapping) Commented Jun 3, 2020 at 21:52
  • You could use answers from Fast replacement of values in a numpy array by making a dictionary from bad_vals and update_vals. Commented Jun 3, 2020 at 21:57
  • @WillemVanOnsem Yes, all values are positive integers Commented Jun 3, 2020 at 21:58
  • @Divakar, yes will give! Had to sleep a bit.. Commented Jun 4, 2020 at 7:17

2 Answers 2

3

Here's one way based on the hinted mapping array method for positive numbers -

def map_values(a, bad_vals, update_vals):
    N = max(a.max(), max(bad_vals))+1
    mapar = np.empty(N, dtype=int)
    mapar[a] = a
    mapar[bad_vals] = update_vals
    out = mapar[a]
    return out

Sample run -

In [94]: a
Out[94]: 
array([[1, 2, 1],
       [4, 5, 6],
       [7, 1, 1]])

In [95]: bad_vals
Out[95]: [4, 2, 6]

In [96]: update_vals
Out[96]: [11, 1, 8]

In [97]: map_values(a, bad_vals, update_vals)
Out[97]: 
array([[ 1,  1,  1],
       [11,  5,  8],
       [ 7,  1,  1]])

Benchmarking

# Original soln
def replacevals(a, bad_vals, update_vals):
    out = a.copy()
    for idx, v in enumerate(bad_vals):
        out[out==v] = update_vals[idx]
    return out

The given sample had the 2D input of nxn with n samples to be replaced. Let's setup input datasets with the same structure.

Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.

import benchit
funcs = [replacevals, map_values]
in_ = {n:(np.random.randint(0,n*10,(n,n)),np.random.choice(n*10,n,replace=False),np.random.choice(n*10,n)) for n in [3,10,100,1000,2000]}
t = benchit.timings(funcs, in_, multivar=True, input_name='Len')
t.plot(logx=True, save='timings.png')

Plot :

enter image description here

Sign up to request clarification or add additional context in comments.

15 Comments

This is a really nice solution. It is 740X more quick than my solution for my real use case. Thanks for sharing this. Also nice benchit package. Let me try to see if I can combine the other solutions (which was 55X more quick than my approack) in a chart and update my answer with this. Thanks again!
@Mattijn Yeah you can just add any other approach into funcs = [replacevals, map_values] with the function name(s). Should be convenient that way. Would like to see your chart(s), if you would like to share.
@Divakar--Benchit looks interesting. How does benchit compare to Perfplot which I have used? Any advantages/disadvantages?
@Divakar--OK, will give it a try for my next benchmark. Two advantages I see benchit has are: 1) it shows the test environment information on the top left of the screen, 2) it has a nicer grid (horizontal & vertical) to display the results.
@Divakar--Thanks! I was able to run your basic test on the online Python with the new release. Adding `bench = "^0.0.3" to the Python spec file is needed for it to load benchit and its dependencies, although it still loads bench-it also.
|
2

This really depends on the size of your array, and the size of your mappings from bad to good integers.

For a larger number of bad to good integers - the method below is better:

import numpy as np
import time

ARRAY_ROWS = 10000
ARRAY_COLS = 1000

NUM_MAPPINGS = 10000

bad_vals = np.random.rand(NUM_MAPPINGS)
update_vals = np.random.rand(NUM_MAPPINGS)

bad_to_good_map = {}
for idx, bad_val in enumerate(bad_vals):
    bad_to_good_map[bad_val] = update_vals[idx]

# np.vectorize with mapping
# Takes about 4 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
f = np.vectorize(lambda x: (bad_to_good_map[x] if x in bad_to_good_map else x))
print (time.time())
a = f(a)
print (time.time())


# Your way
# Takes about 60 seconds
a = np.random.rand(ARRAY_ROWS, ARRAY_COLS)
print (time.time())
for idx, v in enumerate(bad_vals):
    a[a==v] = update_vals[idx]
print (time.time())

Running the code above it took less than 4 seconds for the np.vectorize(lambda) way to finish - whereas your way took almost 60 seconds. However, setting the NUM_MAPPINGS to 100, your method takes less than a second for me - faster than the 2 seconds for the np.vectorize way.

1 Comment

Thanks a lot for sharing your solution which provided a 55X speedup compare to my solutions in my real data. While amazing, the solution provided by @Divakar had a speedup of 741X. Thanks again!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.