Generate random binary matrix with all theirs rows different using numpy

Question

I need to generate a random binary matrix with dimensions m x n where all their rows are different among themselves. Using numpy I tried

import numpy as np
import random
n = 512
m = 1000

a = random.sample(range(0, 2**n), m)
a = np.array(a)
a = np.reshape(a, (m, 1))
np.unpackbits(a.view(np.uint8), axis=1)

But it is not suitable for my case because n > 128 and m > 1000. So, the code above generates only rows with at most 62 elements. Could you help me, please?

2^512 is an insanely large number Are you sure it is 2**n and not 2*n ? — DollarAkshay
– DollarAkshay, Commented Mar 22, 2022 at 7:24
Also, is number_of_samples simply m, or some other number? If number_of_samples is replaced with m, and 2**n with 2*n the result is an array with 1000 elements. With 2**n, the code won't run due to OverflowError: Python int too large to convert to C ssize_t — Grismar
– Grismar, Commented Mar 22, 2022 at 7:26
yes, number_of_samples=m. Exactly this is the problem OverflowError: Python int too large to convert to C. Maybe another approach should be necessary — Juan
– Juan, Commented Mar 22, 2022 at 7:35

Tonechas · Accepted Answer · 2022-03-22 08:36:23Z

2

You could generate a random array of 0's and 1's with numpy.random.choice and then make sure that the rows are different through numpy.unique:

import numpy as np

m = 1000
n = 512

while True:
    a = np.random.choice([0, 1], size=(m, n))
    if len(np.unique(a, axis=0)) == m:
        break

answered Mar 22, 2022 at 8:36

Tonechas

13.8k16 gold badges52 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Tomer Geva · Accepted Answer · 2022-03-22 18:44:00Z

I would try creating one row at a time and check if that row exists already via a set which has a membership testing runtime of O(1). If the row exists simply generate another 1, if not add it to the array and move to the next row until you are done. This principle can be made faster by:

Setting the unique counter to 0
generating m - counter rows, adding the unique rows to the solution
increasing counter the by unique rows added
if counter == m you are done, else return to 2

The implementation is as follows:

import numpy as np
n = 128
m = 1000
a = np.zeros((m,n))
rows = set()
counter = 0
while counter < m:
    temp = np.random.randint(0, 2, (m-counter, n))
    for row in temp:
        if tuple(row) not in rows:
            rows.add(tuple(row))
            a[counter] = row
            counter += 1

Runtime comparison

By generating all the matrix at once and checking if all the rows are unique you are saving a lot of time, only if n >> log2(m).

Example 1 with the following:

n = 128
m = 1000

I ran my suggestion and the solution mentions in the other answer, resulting in:

# my suggestion
17.7 ms ± 328 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# generating all matrix at once and chacking if all rows are unique
4.62 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is because the probabily of generating m different rows is very high in this situation.

Example 2 When changing to:

n = 10
m = 1024

I ran my suggestion and the solution mentions in the other answer, resulting in:

# my suggestion
26.3 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The suggestion of generating all matrix at once and checking if all rows are unique did not finish running. This is because when math.log2(m) == n there are exactly m valid rows. The probability of generating a valid matrix randomly approaches 0 as the shape of the matrix increases.

Kevin · Accepted Answer · 2022-03-22 10:53:14Z

You could create a matrix with unique rows and shuffle the rows:

n = 512
m = 1000

d = np.arange(m) # m unique numbers
d = ((d[:, None] & (1 << d[:n])) > 0).astype(np.uint8) # convert to binary array
i = np.random.randn(m).argsort() # indices used for shuffling rows
a = d[i] # output

all rows are unique:

assert len(np.unique(a, axis=0)) == m

Timings

n=128, m=1000:

271 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

n=2**10, m=2**14:

50.9 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This works best for n <= m, otherwise you need to swap d[:n] with np.arange(n), resulting in longer runtime.

Collectives™ on Stack Overflow

Generate random binary matrix with all theirs rows different using numpy

3 Answers 3

Comments

Runtime comparison

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Runtime comparison

Comments

Comments

Related