I would try creating one row at a time and check if that row exists already via a set which has a membership testing runtime of O(1). If the row exists simply generate another 1, if not add it to the array and move to the next row until you are done. This principle can be made faster by:
- Setting the unique
counter to 0
- generating
m - counter rows, adding the unique rows to the solution
- increasing counter the by unique rows added
- if
counter == m you are done, else return to 2
The implementation is as follows:
import numpy as np
n = 128
m = 1000
a = np.zeros((m,n))
rows = set()
counter = 0
while counter < m:
temp = np.random.randint(0, 2, (m-counter, n))
for row in temp:
if tuple(row) not in rows:
rows.add(tuple(row))
a[counter] = row
counter += 1
Runtime comparison
By generating all the matrix at once and checking if all the rows are unique you are saving a lot of time, only if n >> log2(m).
Example 1
with the following:
n = 128
m = 1000
I ran my suggestion and the solution mentions in the other answer, resulting in:
# my suggestion
17.7 ms ± 328 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# generating all matrix at once and chacking if all rows are unique
4.62 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is because the probabily of generating m different rows is very high in this situation.
Example 2
When changing to:
n = 10
m = 1024
I ran my suggestion and the solution mentions in the other answer, resulting in:
# my suggestion
26.3 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The suggestion of generating all matrix at once and checking if all rows are unique did not finish running. This is because when math.log2(m) == n there are exactly m valid rows. The probability of generating a valid matrix randomly approaches 0 as the shape of the matrix increases.
2^512is an insanely large number Are you sure it is2**nand not2*n?number_of_samplessimplym, or some other number? Ifnumber_of_samplesis replaced withm, and2**nwith2*nthe result is an array with1000elements. With2**n, the code won't run due toOverflowError: Python int too large to convert to C ssize_t