45

For example:

>>> x = [1, 1, 2, 'a', 'a', 3]
>>> unique(x)
[1, 2, 'a', 3]

Assume list elements are hashable.

Clarification: The result should keep the first duplicate in the list. For example, [1, 2, 3, 2, 3, 1] becomes [1, 2, 3].

3
  • 1
    Are we keeping the first of duplicates, or last, or somewhere in the middle? e.g., [1,2,3,2,3,1], does that become [1,2,3], or [2,3,1], or something else? Commented Sep 18, 2008 at 1:29
  • 1
    Benchmark and a clear answer here. Commented Sep 18, 2008 at 11:54
  • 2
    How Do I apply the Homework tag to something? When it says assume elements are hashable, your prof is asking you to put the entries in a hashtable, then it's easy to see if youve come across them before as you walk down the list. Commented Nov 12, 2008 at 0:06

27 Answers 27

33
def unique(items):
    found = set()
    keep = []

    for item in items:
        if item not in found:
            found.add(item)
            keep.append(item)
            
    return keep

print unique([1, 1, 2, 'a', 'a', 3])
Sign up to request clarification or add additional context in comments.

8 Comments

set() is better than set([]).
In-place algorithms are faster. See james' and mine answers.
This is an old thread, but if you make the add() and append() methods local function (put add = found.add and app = keep.append before the loop and then use add(item) and app(item), then this is the fastest by far. The reason that the dictionary usage was faster was that it didn't require an attribute look-up for each add and append. Just my two cents.
If you put it into a list comprehension afterwards, you get another speed improvement. Taking all the changes together, speed nearly doubles. See my comparison further down this page.
@so.very.tired Because keep is a list, and checking for membership in a list takes on average linear time in the length of the list. Meanwhile, checking for membership in a set takes on average constant time (see this). Using appropriate data structures is a deal breaker when it comes to performance. In any case, this answer is outdated. Check out this question instead.
|
21

Using:

lst = [8, 8, 9, 9, 7, 15, 15, 2, 20, 13, 2, 24, 6, 11, 7, 12, 4, 10, 18, 13, 23, 11, 3, 11, 12, 10, 4, 5, 4, 22, 6, 3, 19, 14, 21, 11, 1, 5, 14, 8, 0, 1, 16, 5, 10, 13, 17, 1, 16, 17, 12, 6, 10, 0, 3, 9, 9, 3, 7, 7, 6, 6, 7, 5, 14, 18, 12, 19, 2, 8, 9, 0, 8, 4, 5]

And using the timeit module:

$ python -m timeit -s 'import uniquetest' 'uniquetest.etchasketch(uniquetest.lst)'

And so on for the various other functions (which I named after their posters), I have the following results (on my first generation Intel MacBook Pro):

Allen:                  14.6 µs per loop [1]
Terhorst:               26.6 µs per loop
Tarle:                  44.7 µs per loop
ctcherry:               44.8 µs per loop
Etchasketch 1 (short):  64.6 µs per loop
Schinckel:              65.0 µs per loop
Etchasketch 2:          71.6 µs per loop
Little:                 89.4 µs per loop
Tyler:                 179.0 µs per loop

[1] Note that Allen modifies the list in place – I believe this has skewed the time, in that the timeit module runs the code 100000 times and 99999 of them are with the dupe-less list.


Summary: Straight-forward implementation with sets wins over confusing one-liners :-)

4 Comments

james suggested a faster version. See stackoverflow.com/questions/89178/#91430
Bump for using OSX. ;-)
@jfs: Both james and Allen's versions mutate in-place, so unless the microbenchmark used accounts for that (e.g. by calling the functions using a fresh list each time and/or including no duplicates at all), the timings aren't comparable. The fastest solutions nowadays (from 3.6 onwards) are either the unique_everseen recipe from itertools (if you need to process each element as it is found) which is about 10% faster than Terhorst's solution, or, at 40-80% of the runtime of unique_everseen, list(dict.fromkeys(iterable)) (or lst[:] = dict.fromkeys(lst) to operate "in place").
@ShadowRanger: my microbenchmarks include f(lst[:]) (i.e., the copying happens on each call). Though Python has changed in 12+ years. I would not rely on the microbenchmarks results from so many years ago.
18

Update: on Python3.7+:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

old answer:

Here is the fastest solution so far (for the following input):

def del_dups(seq):
    seen = {}
    pos = 0
    for item in seq:
        if item not in seen:
            seen[item] = True
            seq[pos] = item
            pos += 1
    del seq[pos:]

lst = [8, 8, 9, 9, 7, 15, 15, 2, 20, 13, 2, 24, 6, 11, 7, 12, 4, 10, 18, 
       13, 23, 11, 3, 11, 12, 10, 4, 5, 4, 22, 6, 3, 19, 14, 21, 11, 1, 
       5, 14, 8, 0, 1, 16, 5, 10, 13, 17, 1, 16, 17, 12, 6, 10, 0, 3, 9, 
       9, 3, 7, 7, 6, 6, 7, 5, 14, 18, 12, 19, 2, 8, 9, 0, 8, 4, 5]
del_dups(lst)
print(lst)
# -> [8, 9, 7, 15, 2, 20, 13, 24, 6, 11, 12, 4, 10, 18, 23, 3, 5, 22, 19, 14, 
#     21, 1, 0, 16, 17]

Dictionary lookup is slightly faster then the set's one in Python 3.

9 Comments

Could you explain why the Dictionary lookup is faster than a test for set membership in this case?
@Stephen Emslie: I don't know. It might be a benchmark artifact. Try it yourself. A pure speculation: dictionary is a fundamental data structure for CPython (namespaces, classes are/were implemented via dictionaries) therefore dicts are more tuned/optimized than sets.
Good timing, but wrong conclusion. The timing shows only that operator access such as d[k] = v is faster than method call access such as d.__setitem__(k, v) even if the latter has been pre-bound using d_setitem = d.__setitem__ and then timing d_setitem(k, v).
Using Python 3.4, I tried your test script; and the function that uses a set is consistently slightly faster.
@jfs: In 3.6+, this can be made much simpler and even faster by simplifying to just def del_dups(seq): seq[:] = dict.fromkeys(seq) (to modify in-place) or def del_dups(seq): return list(dict.fromkeys(seq)) to make a new copy without duplicates.
|
15

What's going to be fastest depends on what percentage of your list is duplicates. If it's nearly all duplicates, with few unique items, creating a new list will probably be faster. If it's mostly unique items, removing them from the original list (or a copy) will be faster.

Here's one for modifying the list in place:

def unique(items):
  seen = set()
  for i in xrange(len(items)-1, -1, -1):
    it = items[i]
    if it in seen:
      del items[i]
    else:
      seen.add(it)

Iterating backwards over the indices ensures that removing items doesn't affect the iteration.

2 Comments

This gives different results to the other solutions (the OP didn't specify which is correct), as regards which duplicate to keep. This solution: [1, 2, 1] -> [2, 1] Other solutions: [1, 2, 1] -> [1, 2]
I added a clarification about this in the question text.
10

This is the fastest in-place method I've found (assuming a large proportion of duplicates):

def unique(l):
    s = set(); n = 0
    for x in l:
        if x not in s: s.add(x); l[n] = x; n += 1
    del l[n:]

This is 10% faster than Allen's implementation, on which it is based (timed with timeit.repeat, JIT compiled by psyco). It keeps the first instance of any duplicate.

repton-infinity: I'd be interested if you could confirm my timings.

1 Comment

Dictionaries are slightly faster than sets. See my answer stackoverflow.com/questions/89178/#282589
7

Obligatory generator-based variation:

def unique(seq):
  seen = set()
  for x in seq:
    if x not in seen:
      seen.add(x)
      yield x

Comments

7

This may be the simplest way:

list(OrderedDict.fromkeys(iterable))

As of Python 3.5, OrderedDict is now implemented in C, so this was is now the shortest, cleanest, and fastest.

2 Comments

Elegant, but unfortunately about a magnitude slower than the fastest solution, and surprisingly one of the slowest solutions in general. The OrderedDict seems to be a real performance killer.
Perhaps if OrderedSet ever becomes a builtin we'd have a very fast solution
5

Taken from http://www.peterbe.com/plog/uniqifiers-benchmark

def f5(seq, idfun=None):  
    # order preserving 
    if idfun is None: 
        def idfun(x): return x 
    seen = {} 
    result = [] 
    for item in seq: 
        marker = idfun(item) 
        # in old Python versions: 
        # if seen.has_key(marker) 
        # but in new ones: 
        if marker in seen: continue 
        seen[marker] = 1 
        result.append(item) 
    return result

1 Comment

it is slower than corresponding in-place version (at least for some inputs). See stackoverflow.com/questions/89178/#282589
5

One-liner:

new_list = reduce(lambda x,y: x+[y][:1-int(y in x)], my_list, [])

Comments

4

An in-place one-liner for this:

>>> x = [1, 1, 2, 'a', 'a', 3]
>>> [ item for pos,item in enumerate(x) if x.index(item)==pos ]
[1, 2, 'a', 3]

3 Comments

HI Mario, How this works, please explain, what I understood is index return only one value, so its unique ?
The list.index(item) method returns the position of the first item found in the list, and comparing this with the actual position of the item (from enumerate) we can therefore tell whether that item is the first occurring or not, keeping only the first occurring.
Very nice solution. But is it really in-place? When you use list comprehension, isn't a new list being created?
4

This is the fastest one, comparing all the stuff from this lengthy discussion and the other answers given here, refering to this benchmark. It's another 25% faster than the fastest function from the discussion, f8. Thanks to David Kirby for the idea.

def uniquify(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if x not in seen and not seen_add(x)]

Some time comparison:

$ python uniqifiers_benchmark.py 
* f8_original 3.76
* uniquify 3.0
* terhorst 5.44
* terhorst_localref 4.08
* del_dups 4.76

4 Comments

I don't see time comparison with solutions from the top answers here. In my experience, explicit loops are faster than list comprehension in CPython (at least it requires a benchmark to test for each particular case).
I added the timings above. The main overhead in the presented solutions is the property lookup for add, append and so on, but even if you kick that out, the list comprehension is about 25% faster than terhorst_localreferences.
could you include the complete benchmark code in your answer? I don't see terhorst (or any other relevant code) in the file you linked.
3

You can actually do something really cool in Python to solve this. You can create a list comprehension that would reference itself as it is being built. As follows:

   # remove duplicates...
   def unique(my_list):
       return [x for x in my_list if x not in locals()['_[1]'].__self__]

Edit: I removed the "self", and it works on Mac OS X, Python 2.5.1.

The _[1] is Python's "secret" reference to the new list. The above, of course, is a little messy, but you could adapt it fit your needs as necessary. For example, you can actually write a function that returns a reference to the comprehension; it would look more like:

return [x for x in my_list if x not in this_list()]

2 Comments

The example as given does not compile for me -- the trailing ".__self__" is not valid [[Linux 2.6 w/ Python 2.5.1]]
Holy cow, you're turning Python into Perl with the magic underscore business. Just say no.
2

Do the duplicates necessarily need to be in the list in the first place? There's no overhead as far as looking the elements up, but there is a little bit more overhead in adding elements (though the overhead should be O(1) ).

>>> x  = []
>>> y = set()
>>> def add_to_x(val):
...     if val not in y:
...             x.append(val)
...             y.add(val)
...     print x
...     print y
... 
>>> add_to_x(1)
[1]
set([1])
>>> add_to_x(1)
[1]
set([1])
>>> add_to_x(1)
[1]
set([1])
>>> 

Comments

2

Remove duplicates and preserve order:

This is a fast 2-liner that leverages built-in functionality of list comprehensions and dicts.

x = [1, 1, 2, 'a', 'a', 3]

tmpUniq = {} # temp variable used below 
results = [tmpUniq.setdefault(i,i) for i in x if i not in tmpUniq]

print results
[1, 2, 'a', 3]

The dict.setdefaults() function returns the value as well as adding it to the temp dict directly in the list comprehension. Using the built-in functions and the hashes of the dict will work to maximize efficiency for the process.

Comments

1

O(n) if dict is hash, O(nlogn) if dict is tree, and simple, fixed. Thanks to Matthew for the suggestion. Sorry I don't know the underlying types.

def unique(x):    
  output = []
  y = {}
  for item in x:
    y[item] = ""

  for item in x:
    if item in y:
      output.append(item)

  return output

1 Comment

FYI, you can also do that with a set so you don't have to set it equal to an empty string.
1

has_key in python is O(1). Insertion and retrieval from a hash is also O(1). Loops through n items twice, so O(n).

def unique(list):
  s = {}
  output = []
  for x in list:
    count = 1
    if(s.has_key(x)):
      count = s[x] + 1

    s[x] = count
  for x in list:
    count = s[x]
    if(count > 0):
      s[x] = 0
      output.append(x)
  return output

Comments

1

There are some great, efficient solutions here. However, for anyone not concerned with the absolute most efficient O(n) solution, I'd go with the simple one-liner O(n^2*log(n)) solution:

def unique(xs):
    return sorted(set(xs), key=lambda x: xs.index(x))

or the more efficient two-liner O(n*log(n)) solution:

def unique(xs):
    positions = dict((e,pos) for pos,e in reversed(list(enumerate(xs))))
    return sorted(set(xs), key=lambda x: positions[x])

3 Comments

That code is difficult to understand, and you say it's less efficient than the other solutions already presented here. So why would you go with it?
I consider this easy to understand; passing a lambda function as the key parameter of sorted is really the canonical way to sort a list in Python. Most of my Python work involves generating reports on lists of statistics, and so to me this seems like the simplest and most Pythonic approach.
While I agree your solution is succinct, the question asked for the fastest algorithm, not the most Pythonic.
1

Here are two recipes from the itertools documentation:

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBCcAD', str.lower) --> A B C A D
    return imap(next, imap(itemgetter(1), groupby(iterable, key)))

Comments

0

I have no experience with python, but an algorithm would be to sort the list, then remove duplicates (by comparing to previous items in the list), and finally find the position in the new list by comparing with the old list.

Longer answer: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

Comments

0
>>> def unique(list):
...   y = []
...   for x in list:
...     if x not in y:
...       y.append(x)
...   return y

1 Comment

To explain why: searching for x in a list structure (y) is O(n), while searching for x in a set (or dictionary) is O(1).
0

If you take out the empty list from the call to set() in Terhost's answer, you get a little speed boost.

Change: found = set([])
to: found = set()

However, you don't need the set at all.

def unique(items):
    keep = []

    for item in items:
        if item not in keep:
            keep.append(item)

    return keep

Using timeit I got these results:

with set([]) -- 4.97210427363
with set() -- 4.65712377445
with no set -- 3.44865284975

1 Comment

yeah, when you have few data, I bet the set internal mecanisme is slower that iterating over a list. But if you got maaaaaaaaaaany element, I think set are faster. Or what would be the point of this data structures ;-)
0
x = [] # Your list  of items that includes Duplicates

# Assuming that your list contains items of only immutable data types

dict_x = {} 

dict_x = {item : item for i, item in enumerate(x) if item not in dict_x.keys()}
# Average t.c. = O(n)* O(1) ; furthermore the dict comphrehension and generator like behaviour of enumerate adds a certain efficiency and pythonic feel to it.

x = dict_x.keys() # if you want your output in list format 

5 Comments

What can go wrong with items of mutable types?
This doesn't work the way you think it does. if item not in dict_x.keys() is checking the keys of the original, empty dict_x, not the dictionary being created. It's always true. The duplicates are removed simply because trying to create a duplicate key is ignored.
Why are you using enumerate()?
@ByteEater Mutable types can't be used as dictionary keys.
@Bramar, maybe he tried to enable mutable types by writing i: item instead if item : item (although a single name for the pair would suffice) and then do .values() instead of .keys() in both places. But that won't work because of your first comment.
-1
>>> x=[1,1,2,'a','a',3]
>>> y = [ _x for _x in x if not _x in locals()['_[1]'] ]
>>> y
[1, 2, 'a', 3]


"locals()['_[1]']" is the "secret name" of the list being created.

2 Comments

Presence of _[1] local is not guaranteed by language.
"<item> in <list>" is O(n), so this is slow.
-1

I don't know if this one is fast or not, but at least it is simple.

Simply, convert it first to a set and then again to a list

def unique(container):
  return list(set(container))

1 Comment

This does not preserve order.
-2

One pass.

a = [1,1,'a','b','c','c']

new_list = []
prev = None

while 1:
    try:
        i = a.pop(0)
        if i != prev:
            new_list.append(i)
        prev = i
    except IndexError:
        break

1 Comment

Requires sorted input, doesn't it?
-2

I haven't done any tests, but one possible algorithm might be to create a second list, and iterate through the first list. If an item is not in the second list, add it to the second list.

x = [1, 1, 2, 'a', 'a', 3]
y = []
for each in x:
    if each not in y:
        y.append(each)

5 Comments

I find your use of the variable name "each" really confusing to read, probably because in many languages it is a keyword. It's much clearer to use item or just i.
'i' to me implies an index - we aren't iterating through indices, we are iterating through objects. I'd prefer item, but I don't see 'each' as bad - just because it is a keyword in another language, why prevent it's use here. Syntax highlighting (as shown above) picks it up fine...
Other than AppleScript, what languages use the word 'each' as a keyword?
You should have used a set. This is unlikely to be the fastest.
Marcin: "... while preserving order".
-2
a=[1,2,3,4,5,7,7,8,8,9,9,3,45]

def unique(l):

    ids={}
    for item in l:
        if not ids.has_key(item):
            ids[item]=item
    return  ids.keys()
print a

print unique(a)

Inserting elements will take theta(n) retrieving if element is exiting or not will take constant time testing all the items will take also theta(n) so we can see that this solution will take theta(n). Bear in mind that dictionary in python implemented by hash table.

1 Comment

The questions says "while preserving order". A Python dictionary doesn't preserve order.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.