Remove duplicate lines without sorting [duplicate]

Question

I have a utility script in Python:

#!/usr/bin/env python
import sys
unique_lines = []
duplicate_lines = []
for line in sys.stdin:
  if line in unique_lines:
    duplicate_lines.append(line)
  else:
    unique_lines.append(line)
    sys.stdout.write(line)
# optionally do something with duplicate_lines

This simple functionality (uniq without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn't it? Maybe a combination of filters in a pipe?

Reason for asking: needing this functionality on a system on which I cannot execute Python from anywhere.

Unrelated: you should really use a set rather than a list in that Python script; checking for membership in a list is a linear-time operation. — Nicholas Riley
– Nicholas Riley, Commented Jul 17, 2012 at 23:18
I removed "Python" from your tags and title since this really has nothing to do with Python. — Michael Hoffman
– Michael Hoffman, Commented Jul 17, 2012 at 23:20
if this had to be done in Python a better approach would involve using the uniq_everseen itertools recipe: docs.python.org/library/itertools.html#recipes — iruvar
– iruvar, Commented Jul 23, 2012 at 17:02

Dalija Prasnikar · Accepted Answer · 2024-07-28 11:40:51Z

406

The UNIX Bash Scripting blog suggests:

awk '!x[$0]++'

This command is telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x is incremented and the line printed if the content of that node was not (!) previously set.

edited Jul 28, 2024 at 11:40

Dalija Prasnikar♦

28.7k46 gold badges96 silver badges179 bronze badges

answered Jul 17, 2012 at 23:17

Michael Hoffman

34.7k7 gold badges67 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

Jeff Klukas Over a year ago

For a short awk statement like this (no curly brackets involved), the command is simply telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, we are incrementing a node of the array named x and printing the line if the content of that node was not (!) previously set.

Josip Rodin Over a year ago

Surely it would be less obfuscated to name that array e.g. seen instead of x, to avoid giving newbies the impression that awk syntax is line noise

Hitechcomputergeek Over a year ago

Keep in mind that this will load the entire file into memory, so don't try this on a 3GB text file without lots of RAM to spare.

deltaray Over a year ago

@Hitechcomputergeek This won't necessarily load the whole file into memory, only the unique lines. This of course could end up being the whole file though if all the lines are unique.

Jonas Elfström Over a year ago

stackoverflow.com/a/1444448/44620 with a detailed description of how it works.

|

aksh1618 · Accepted Answer · 2020-06-29 17:12:45Z

111

A late answer - I just ran into a duplicate of this - but perhaps worth adding...

The principle behind @1_CR's answer can be written more concisely, using cat -n instead of awk to add line numbers:

cat -n file_name | sort -uk2 | sort -n | cut -f2-

Use cat -n to prepend line numbers
Use sort -u remove duplicate data (-k2 says 'start at field 2 for sort key')
Use sort -n to sort by prepended number
Use cut to remove the line numbering (-f2- says 'select field 2 till end')

edited Jun 29, 2020 at 17:12

aksh1618

2,59927 silver badges46 bronze badges

answered Dec 17, 2013 at 16:39

Digital Trauma

16.1k4 gold badges55 silver badges87 bronze badges

8 Comments

Sopalajo de Arrierez Over a year ago

Easy to understand, and this is often valuable. Any ideas of performance with big files against shortest Michael Hoffman's solution above?

Petru Zaharia Over a year ago

More readable/maintainable. Needed the same but with a reverse sort to keep only the last occurrence of each unique value. Using both --reverse and --unique in the same sort command doesn't return the results one might expect. Apparently, sort does a premature optimization by 1st applying --unique on the input (in order to reduce processing in subsequent steps). This removes data needed for the --reverse step too early. To fix this, insert a sort --reverse -k2 as the 1st sort in the pipeline: cat -n file_name | sort -rk2 | sort -uk2 | sort -nk1 | cut -f2-

ynn Over a year ago

Took just 60 seconds for a 900MB+ text file with so many (randomly placed) duplicate lines that the result is only 39KB. Sufficiently fast.

Victor Yarema Over a year ago

"Pipe" version: cat file_name | cat -n | sort -uk2 | sort -nk1 | cut -f2-.

Victor Yarema Over a year ago

|

AzizSM · Accepted Answer · 2017-08-22 03:32:34Z

10

To remove duplicate from 2 files :

awk '!a[$0]++' file1.csv file2.csv

answered Aug 22, 2017 at 3:32

AzizSM

6,3494 gold badges45 silver badges54 bronze badges

Comments

Mateen Ulhaq · Accepted Answer · 2024-03-02 01:53:17Z

8

`uq`

uq is a small tool written in Rust. It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.

There are two advantages of this tool over the top-voted awk solution and other shell-based solutions:

uq remembers the occurence of lines using their hash values, so it doesn't use as much memory use when the lines are long.
uq can keep the memory usage constant by setting a limit on the number of entries to store (when the limit is reached, there is a flag to control either to override or to die), while the awk solution could run into OOM when there are too many lines.

edited Mar 2, 2024 at 1:53

Mateen Ulhaq

27.8k21 gold badges121 silver badges155 bronze badges

answered Apr 30, 2018 at 8:45

shouya

3,1231 gold badge27 silver badges48 bronze badges

1 Comment

ahmet alp balkan Over a year ago

Quite inconvenient and less portable, given awk already does this.

iruvar · Accepted Answer · 2012-07-23 16:43:38Z

4

Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash

awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'

answered Jul 23, 2012 at 16:43

iruvar

23.5k7 gold badges58 silver badges83 bronze badges

1 Comment

galois Over a year ago

this seems to be rather slow, though

hwertz · Accepted Answer · 2013-10-23 18:26:14Z

Thanks 1_CR! I needed a "uniq -u" (remove duplicates entirely) rather than uniq (leave 1 copy of duplicates). The awk and perl solutions can't really be modified to do this, your's can! I may have also needed the lower memory use since I will be uniq'ing like 100,000,000 lines 8-). Just in case anyone else needs it, I just put a "-u" in the uniq portion of the command:

awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq -u --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'

Bence Kaulics · Accepted Answer · 2016-02-05 10:50:59Z

0

I just wanted to remove all duplicates on following lines, not everywhere in the file. So I used:

awk '{
  if ($0 != PREVLINE) print $0;
  PREVLINE=$0;
}'

edited Feb 5, 2016 at 10:50

Bence Kaulics

7,2977 gold badges38 silver badges69 bronze badges

answered Feb 5, 2016 at 10:08

speedolli

111 bronze badge

1 Comment

Mischa Molhoek Over a year ago

doesn't uniq do just that...

Master James · Accepted Answer · 2017-10-06 11:03:21Z

-4

the uniq command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html

answered Oct 6, 2017 at 11:03

Master James

1,80516 silver badges19 bronze badges

Collectives™ on Stack Overflow

Remove duplicate lines without sorting [duplicate]

8 Answers 8

16 Comments

8 Comments

Comments

`uq`

1 Comment

1 Comment

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

16 Comments

8 Comments

Comments

uq

1 Comment

1 Comment

Comments

1 Comment

Comments

Linked

Related

`uq`