Numpy: Get random set of rows from 2D array

Question

I have a very large 2D array which looks something like this:

a=
[[a1, b1, c1],
 [a2, b2, c2],
 ...,
 [an, bn, cn]]

Using numpy, is there an easy way to get a new 2D array with, e.g., 2 random rows from the initial array a (without replacement)?

e.g.

b=
[[a4,  b4,  c4],
 [a99, b99, c99]]

its silly to have a question one for replacement and one without, you should just allow both answers and in fact encourage both answers. — Charlie Parker
– Charlie Parker, Commented Jun 19, 2016 at 21:54

Daniel · Accepted Answer · 2016-10-31 02:19:38Z

298

>>> A = np.random.randint(5, size=(10,3))
>>> A
array([[1, 3, 0],
       [3, 2, 0],
       [0, 2, 1],
       [1, 1, 4],
       [3, 2, 2],
       [0, 1, 0],
       [1, 3, 1],
       [0, 4, 1],
       [2, 4, 2],
       [3, 3, 1]])
>>> idx = np.random.randint(10, size=2)
>>> idx
array([7, 6])
>>> A[idx,:]
array([[0, 4, 1],
       [1, 3, 1]])

Putting it together for a general case:

A[np.random.randint(A.shape[0], size=2), :]

For non replacement (numpy 1.7.0+):

A[np.random.choice(A.shape[0], 2, replace=False), :]

I do not believe there is a good way to generate random list without replacement before 1.7. Perhaps you can setup a small definition that ensures the two values are not the same.

edited Oct 31, 2016 at 2:19

answered Jan 10, 2013 at 16:35

Daniel

19.6k7 gold badges64 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

seberg Over a year ago

There is maybe not a good way, but a way that is just as good as np.random.choice, and that is np.random.permutation(A.shape[0])[:2], actually its not great, but that is what np.random.choice at this time... or if you don't care to change your array in-place, np.random.shuffle

denis Over a year ago

Before numpy 1.7, use random.sample( xrange(10), 2 )

Charlie Parker Over a year ago

why are you naming your variables A and B and stuff? it makes it harder to read.

jtlz2 Over a year ago

@CharlieParker Does it? Matrices are often denoted by single capital letters.

zr0gravity7 Over a year ago

Colon-slicing along the second axis is not necessary ([..., ] is not necessary).

Hezi Resheff · Accepted Answer · 2017-04-02 10:12:13Z

80

This is an old post, but this is what works best for me:

A[np.random.choice(A.shape[0], num_rows_2_sample, replace=False)]

change the replace=False to True to get the same thing, but with replacement.

edited Apr 2, 2017 at 10:12

answered Jan 7, 2015 at 8:37

Hezi Resheff

9777 silver badges7 bronze badges

2 Comments

0x24a537r9 Over a year ago

@SalvadorDali I've edited Hezi's post to not choose with replacement. Once the edit is peer-reviewed, you'll see the added replace=False param to choice.

Him Over a year ago

@SalvadorDali why not?

jtlz2 · Accepted Answer · 2021-07-08 08:47:56Z

34

This is a similar answer to the one Hezi Rasheff provided, but simplified so newer python users understand what's going on (I noticed many new datascience students fetch random samples in the weirdest ways because they don't know what they are doing in python).

You can get a number of random indices from your array by using:

indices = np.random.choice(A.shape[0], number_of_samples, replace=False)

You can then use fancy indexing with your numpy array to get the samples at those indices:

A[indices]

This will get you the specified number of random samples from your data.

edited Jul 8, 2021 at 8:47

jtlz2

8,52711 gold badges74 silver badges128 bronze badges

answered Dec 20, 2018 at 10:35

CB Madsen

6429 silver badges9 bronze badges

2 Comments

mins Over a year ago

Seems to be the best solution, and should be the selected answer. "You can then use slicing", typo: fancy indexing.

CB Madsen Over a year ago

@mins "Fancy indexing" is indeed the correct terminology rather than "Slicing". I fixed this. Thank you.

isosceleswheel · Accepted Answer · 2015-08-03 18:58:14Z

33

Another option is to create a random mask if you just want to down-sample your data by a certain factor. Say I want to down-sample to 25% of my original data set, which is currently held in the array data_arr:

# generate random boolean mask the length of data
# use p 0.75 for False and 0.25 for True
mask = numpy.random.choice([False, True], len(data_arr), p=[0.75, 0.25])

Now you can call data_arr[mask] and return ~25% of the rows, randomly sampled.

answered Aug 3, 2015 at 18:58

isosceleswheel

1,5461 gold badge13 silver badges21 bronze badges

4 Comments

Sarah Over a year ago

You may want to add replace = False if you don't want sampling with replacement.

isosceleswheel Over a year ago

@Sarah Replacement is not an issue with this sampling method because a True/False value is returned for every position in data_arr. In my example, a random ~25% of the positions will be True and those positions are sampled from data_arr.

Sarah Over a year ago

You are right. We don't need the replace=False. And as you pointed out, the number of records sampled is only approximated and not exact.

Eb Abadi Over a year ago

It's an interesting method. However, the number of sampled rows is an approximate to the desired (as stated in the answer). It may not work if you need exactly k rows sampled.

orli · Accepted Answer · 2018-10-19 21:35:49Z

5

I see permutation has been suggested. In fact it can be made into one line:

>>> A = np.random.randint(5, size=(10,3))
>>> np.random.permutation(A)[:2]

array([[0, 3, 0],
       [3, 1, 2]])

answered Oct 19, 2018 at 21:35

orli

1811 silver badge5 bronze badges

Comments

Antiez · Accepted Answer · 2021-08-17 08:35:49Z

3

One can generates a random sample from a given array with a random number generator:

rng = np.random.default_rng()
b = rng.choice(a, 2, replace=False)
b
>>> [[a4,  b4,  c4],
    [a99, b99, c99]]

answered Aug 17, 2021 at 8:35

Antiez

9779 silver badges15 bronze badges

Comments

Ben · Accepted Answer · 2018-10-23 11:24:15Z

2

If you want to generate multiple random subsets of rows, for example if your doing RANSAC.

num_pop = 10
num_samples = 2
pop_in_sample = 3
rows_to_sample = np.random.random([num_pop, 5])
random_numbers = np.random.random([num_samples, num_pop])
samples = np.argsort(random_numbers, axis=1)[:, :pop_in_sample]
# will be shape [num_samples, pop_in_sample, 5]
row_subsets = rows_to_sample[samples, :]

answered Oct 23, 2018 at 11:24

Ben

1,01310 silver badges22 bronze badges

Comments

Snoopy · Accepted Answer · 2020-10-21 20:59:47Z

An alternative way of doing it is by using the choice method of the Generator class, https://github.com/numpy/numpy/issues/10835

import numpy as np

# generate the random array
A = np.random.randint(5, size=(10,3))

# use the choice method of the Generator class
rng = np.random.default_rng()
A_sampled = rng.choice(A, 2)

leading to a sampled data,

array([[1, 3, 2],
       [1, 2, 1]])

The running time is also profiled compared as follows,

%timeit rng.choice(A, 2)
15.1 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit np.random.permutation(A)[:2]
4.22 µs ± 83.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit A[np.random.randint(A.shape[0], size=2), :]
10.6 µs ± 418 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But when the array goes big, A = np.random.randint(10, size=(1000,300)). working on the index is the best way.

%timeit A[np.random.randint(A.shape[0], size=50), :]
17.6 µs ± 657 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit rng.choice(A, 50)
22.3 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.random.permutation(A)[:50]
143 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So the permutation method seems to be the most efficient one when your array is small while working on the index is the optimal solution when your array goes big.

Kit · Accepted Answer · 2017-05-16 22:55:10Z

1

If you need the same rows but just a random sample then,

import random
new_array = random.sample(old_array,x)

Here x, has to be an 'int' defining the number of rows you want to randomly pick.

answered May 16, 2017 at 22:55

Kit

8111 gold badge7 silver badges3 bronze badges

1 Comment

leermeester Over a year ago

This only works if old_array is a sequence or a set, not a numpy array [link] (docs.python.org/3/library/random.html#functions-for-sequences)

Skippy le Grand Gourou · Accepted Answer · 2023-01-17 11:11:35Z

1

I am quite surprised that this much easier to read solution has not been proposed after more than 10 years :

import random

b = np.array(
    random.choices(a, k=2)
)

Edit : Ah, maybe because it was only introduced in Python 3.6, but still…

edited Jan 17, 2023 at 11:11

answered Jan 17, 2023 at 11:05

Skippy le Grand Gourou

7,8526 gold badges65 silver badges81 bronze badges

1 Comment

dawid Over a year ago

How does this compare in speed to the numpy random generator?

Collectives™ on Stack Overflow

Numpy: Get random set of rows from 2D array

10 Answers 10

5 Comments

2 Comments

2 Comments

4 Comments

Comments

Comments

Comments

Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

5 Comments

2 Comments

2 Comments

4 Comments

Comments

Comments

Comments

Comments

1 Comment

1 Comment

Linked

Related