How to turn Numpy array to set efficiently?

Question

I used:

df['ids'] = df['ids'].values.astype(set)

to turn lists into sets, but the output was a list not a set:

>>> x = np.array([[1, 2, 2.5],[12,35,12]])

>>> x.astype(set)
array([[1.0, 2.0, 2.5],
       [12.0, 35.0, 12.0]], dtype=object)

Is there an efficient way to turn list into set in Numpy?

EDIT 1:
My input is as big as below:
I have 3,000 records. Each has 30,000 ids: [[1,...,12,13,...,30000], [1,..,43,45,...,30000],...,[...]]

@AlirezaHos it doesn't seem to me that processing x = np.array([[1, 2, 2.5],[12,35,12]]) should take 19 seconds with any method. Care to elaborate? — Andras Deak -- Слава Україні
– Andras Deak -- Слава Україні, Commented Oct 18, 2015 at 9:59
astype(set) does not do what you think. There isn't a numpy set dtype. So it just returns an object array. — hpaulj
– hpaulj, Commented Oct 18, 2015 at 15:03

P. Camilleri · Accepted Answer · 2015-10-18 09:25:02Z

31

First flatten your ndarray to obtain a single dimensional array, then apply set() on it:

set(x.flatten())

Edit : since it seems you just want an array of set, not a set of the whole array, then you can do value = [set(v) for v in x] to obtain a list of sets.

edited Oct 18, 2015 at 9:25

answered Oct 18, 2015 at 9:20

P. Camilleri

13.3k10 gold badges49 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Andras Deak -- Слава Україні Over a year ago

@AlirezaHos any reason to believe this solution is not efficient? How much data is processed in 19 seconds? 10 elements? 100? 10^10? And any reason for not including your complete problem in the original question?

Divakar Over a year ago

@AlirezaHos Any specific to convert all of that data into sets? Storing as numpy arrays must be pretty efficient.

Andras Deak -- Слава Україні Over a year ago

@Divakar especially that converting to sets should involve going over each element and checking for multiplicities and sorting. No wonder it's slow:)

P. Camilleri Over a year ago

@AlirezaHos a question which leads to 10 comments is generally the sign that information was missing in the original post. Please take note for your future posts.

endolith Over a year ago

set(x.ravel()) should be more efficient because it doesn't make a copy. (Actually set(x.flat) is even better.)

|

Andras Deak -- Слава Україні · Accepted Answer · 2023-04-18 17:58:44Z

The current state of your question (can change any time): how can I efficiently remove duplicate elements from a large array of large arrays?

import numpy as np

rng = np.random.default_rng()
arr = rng.random((3000, 30000))
out1 = list(map(np.unique, arr))
#or
out2 = [np.unique(subarr) for subarr in arr]

Runtimes in an IPython shell:

>>> %timeit list(map(np.unique, arr))
5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Update: as @hpaulj pointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:

>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))

>>> %timeit list(map(np.unique, arr))
4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit [np.unique(subarr) for subarr in arr]
4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.

Should not the problem be stated as "keep unique elements" or "remove duplicates" rather than "remove unique"?

Dadep · Accepted Answer · 2018-09-14 10:59:05Z

1

A couple of earlier 'row-wise' unique questions:

vectorize numpy unique for subarrays

Numpy: Row Wise Unique elements

Count unique elements row wise in an ndarray

In a couple of these the count is more interesting than the actual unique values.

If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.

edited Sep 14, 2018 at 10:59

Dadep

2,7805 gold badges30 silver badges40 bronze badges

answered Oct 18, 2015 at 16:53

hpaulj

233k14 gold badges260 silver badges392 bronze badges

1 Comment

hpaulj Over a year ago

@fabrik, those are SO links, in a 3 yr old answer. By your logic we couldn't mark posts as duplicates without repeating the old answers.

Collectives™ on Stack Overflow

How to turn Numpy array to set efficiently?

3 Answers 3

12 Comments

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

12 Comments

1 Comment

1 Comment

Linked

Related