26

I used:

df['ids'] = df['ids'].values.astype(set)

to turn lists into sets, but the output was a list not a set:

>>> x = np.array([[1, 2, 2.5],[12,35,12]])

>>> x.astype(set)
array([[1.0, 2.0, 2.5],
       [12.0, 35.0, 12.0]], dtype=object)

Is there an efficient way to turn list into set in Numpy?

EDIT 1:
My input is as big as below:
I have 3,000 records. Each has 30,000 ids: [[1,...,12,13,...,30000], [1,..,43,45,...,30000],...,[...]]

2
  • 1
    @AlirezaHos it doesn't seem to me that processing x = np.array([[1, 2, 2.5],[12,35,12]]) should take 19 seconds with any method. Care to elaborate? Commented Oct 18, 2015 at 9:59
  • astype(set) does not do what you think. There isn't a numpy set dtype. So it just returns an object array. Commented Oct 18, 2015 at 15:03

3 Answers 3

31

First flatten your ndarray to obtain a single dimensional array, then apply set() on it:

set(x.flatten())

Edit : since it seems you just want an array of set, not a set of the whole array, then you can do value = [set(v) for v in x] to obtain a list of sets.

Sign up to request clarification or add additional context in comments.

12 Comments

@AlirezaHos any reason to believe this solution is not efficient? How much data is processed in 19 seconds? 10 elements? 100? 10^10? And any reason for not including your complete problem in the original question?
@AlirezaHos Any specific to convert all of that data into sets? Storing as numpy arrays must be pretty efficient.
@Divakar especially that converting to sets should involve going over each element and checking for multiplicities and sorting. No wonder it's slow:)
@AlirezaHos a question which leads to 10 comments is generally the sign that information was missing in the original post. Please take note for your future posts.
set(x.ravel()) should be more efficient because it doesn't make a copy. (Actually set(x.flat) is even better.)
|
14

The current state of your question (can change any time): how can I efficiently remove duplicate elements from a large array of large arrays?

import numpy as np

rng = np.random.default_rng()
arr = rng.random((3000, 30000))
out1 = list(map(np.unique, arr))
#or
out2 = [np.unique(subarr) for subarr in arr]

Runtimes in an IPython shell:

>>> %timeit list(map(np.unique, arr))
5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Update: as @hpaulj pointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:

>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))

>>> %timeit list(map(np.unique, arr))
4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.

1 Comment

Should not the problem be stated as "keep unique elements" or "remove duplicates" rather than "remove unique"?
1

A couple of earlier 'row-wise' unique questions:

vectorize numpy unique for subarrays

Numpy: Row Wise Unique elements

Count unique elements row wise in an ndarray

In a couple of these the count is more interesting than the actual unique values.

If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.

1 Comment

@fabrik, those are SO links, in a 3 yr old answer. By your logic we couldn't mark posts as duplicates without repeating the old answers.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.