28
[root@server]# awk '!seen[$0]++' out.txt > cleaned
awk: (FILENAME=out.txt FNR=8547098) fatal error: internal error
Aborted
[root@server]#

The ""server"" has: 8 GByte RAM + 16 GByte SWAP, x>300 GByte free space, amd64, desktop CPU. Scientific Linux 6.6. Nothing else runs on it to make LOAD. Awk aborts after a few seconds.. out.txt is ~1.6 GByte. GNU Awk 3.1.7.

Question: How can I remove the duplicate lines while keeping the order of the lines? Case is important too, ex: "A" and "a" is two different line, have to keep it. But "a" and "a" is duplicate, only the first one is needed.

Answer could be in anything.. if awk is not good for this.. then perl/sed.. what could the problem be?

[root@server]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 61945
max locked memory       (kbytes, -l) 99999999
max memory size         (kbytes, -m) unlimited
open files                      (-n) 999999
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 99999999
cpu time               (seconds, -t) unlimited
max user processes              (-u) 61945
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[root@server]# 

Update: I tried this on a RHEL machine, it doesn't aborts, but I didn't had time to wait for it to finish.. why doesn SL linux differ from RHEL?

Update: I'm trying on an Ubuntu 14 virtual gues.. so far it works! It's not an ulimit problem: mawk 1.3.3

root@asdf-VirtualBox:~# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 51331
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 51331
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
root@asdf-VirtualBox:~# 
4
  • 2
    There are no duplicate lines in your example...? Commented Apr 7, 2015 at 9:51
  • 1
    What are awk versions in two machines? Commented Apr 7, 2015 at 9:57
  • up-to-date rhel and up-to-date sl linux, don't know the rhel version.. sl is: GNU Awk 3.1.7 Commented Apr 7, 2015 at 10:02
  • How large is out.txt? Does the same command work if you try it on a smaller file? How many users on the machine? Was there enough available memory for the process? Is there anything special about line 8547098 of the input file? Commented Apr 7, 2015 at 10:46

6 Answers 6

43

I doubt it will make a difference but, just in case, here's how to do the same thing in Perl:

perl -ne 'print if ++$k{$_}==1' out.txt

If the problem is keeping the unique lines in memory, that will have the same issue as the awk you tried. So, another approach could be:

cat -n out.txt | sort -k2 -k1n  | uniq -f1 | sort -nk1,1 | cut -f2-

How it works:

  1. On a GNU system, cat -n will prepend the line number to each line following some amount of spaces and followed by a <tab> character. cat pipes this input representation to sort.

  2. sort's -k2 option instructs it only to consider the characters from the second field until the end of the line when sorting, and sort splits fields by default on white-space (or cat's inserted spaces and <tab>).
    When followed by -k1n, sort considers the 2nd field first, and then secondly—in the case of identical -k2 fields—it considers the 1st field but as sorted numerically. So repeated lines will be sorted together but in the order they appeared.

  3. The results are piped to uniq—which is told to ignore the first field (-f1 - and also as separated by whitespace)—and which results in a list of unique lines in the original file and is piped back to sort.
  4. This time sort sorts on the first field (cat's inserted line number) numerically, getting the sort order back to what it was in the original file and pipes these results to cut.
  5. Lastly, cut removes the line numbers that were inserted by cat. This is effected by cut printing only from the 2nd field through the end of the line (and cut's default delimiter is a <tab> character).

To illustrate:

$ cat file
bb
aa
bb
dd
cc
dd
aa
bb
cc
$ cat -n file | sort -k2 | uniq -f1 | sort -k1 | cut -f2-
bb
aa    
dd
cc
10
  • 1
    Hi Terdon, the OP needs to keep the order of lines, so the cat|sort|uniq method will not work... Like your perl version though... Commented Apr 7, 2015 at 11:09
  • 2
    Nice solution with sort! But most sort can do uniq by itself so you can short you script by sort -uk2 | sort -bk1,1n Commented Apr 7, 2015 at 11:22
  • @Costas is it most sort? I thought -u was a GNU feature. Commented Apr 7, 2015 at 12:14
  • @don_crissti ah, so it is, thanks. How could I use it here though? As I just noticed (and edited to fix), I need to sort on the 2nd field first and then on the 1st numerically to keep the line order. How can I then use -u and specify that it should ignore the 1st field? According to man sort, the -u is not one of the possible options for -f, so I don't think it can be used here. Commented Apr 7, 2015 at 12:25
  • 1
    this is the Schwartzian transform! (+1) Commented Apr 7, 2015 at 17:49
7
#!/usr/bin/perl 
use DB_File;
tie %h, 'DB_File';

while(<>){ not $h{$_} and print and $h{$_}=1 }

EDIT 1: Does it really work? (comparing)

Sol1 : Terdon et all Schwartzian-transform-like one-liner
    cat -n _1 | sort -uk2 | sort -nk1 | cut -f2-

Sol2 : perl  + DB_File (this answer)
    perl dbfile-uniq _1

Sol3 : PO (John W. Gill solution has a similar behavior)
    awk '!seen[$0]++' _1

Sol4: Terdon perl
    perl -ne 'print if ++$k{$_}==1' _1

Case1: 100_000_000 random numbers (5 digit each), 566Mbytes, 31_212 different values:

$ while true ; do echo $RANDOM; done | head -100000000 > _1

Case 2: 50_000_000 rand numbers (10 digits each), 516Mbytes, 48_351_464 different values:

$ shuf _1 |  sed 'N;s/\n/ /' > _11

(the following numbers are not very precise):

┌────────┬────────┬────────────────┬────────┬──────┐
│        │ Sol1   │ Sol2           │ Sol3   │ Sol4 │
│        │ sort...│ perl DB        │ awk    │ perl │
├────────┼────────┼────────────────┼────────┼──────┤
│ case 1 │ 6m15   │ 6m17           │ 0m28   │ 0m28 │
├────────┼────────┼────────────────┼────────┴──────┤
│ case 2 │ 11m15  │ 81m44          │ out of memory │
├────────┼────────┼────────────────┼────────┬──────┤
│ case 2 │        │ 5m54 /cache=2G │        │      │
└────────┴────────┴────────────────┴────────┴──────┘

sol2 with cache is:

use DB_File;
use Fcntl ;

$DB_HASH->{'cachesize'} = 2000_000_000;
tie %h, 'DB_File', "_my.db", O_RDWR|O_CREAT|O_TRUNC, 0640, $DB_HASH;

while(<>){ not $h{$_} and print and $h{$_}=1 }

Sort can also be optimize adding a cachesize option (not done).

One quick conclusion:

  • sort is a fantastic command!
4
  • 1
    sort -uk2 and sort -nk1,1 are different. The first considers from the 2cd key to the end of the line, the second considers only the first key. You should change your sort -nk1 there - it might even be faster that way, but it will definitely be more reliable. By the way - those are some pretty boxes. Commented Apr 8, 2015 at 16:03
  • @mikeserv, thank you for the comment. As K1,1 is unique, sort -nk1 and sort -nk1,1 return the some result. I tried both, the result was the same and the time was not distinctive. Commented Apr 8, 2015 at 17:32
  • That makes sense - thanks for trying it, though. So cat -n does a tab? I don't know how that command works. Commented Apr 8, 2015 at 18:02
  • 1
    @mikeserv, happily cat -n transfrom each line in spaces + the number + \t + line -- the ideal format for sort and cut Commented Apr 8, 2015 at 18:33
2

I've used

awk -v BINMODE=rw '!($0 in a){a[$0];print}' infile >> outfile

BINMODE=rw : to keep end of line terminators happy. (I live in a mixed os environment)

Logic is simple.

If the current line isn't in the associative array then add it to the associative array and print to output.

There may be memory limitations with this approach. For very large files and sets of files, I've used variations on this, using file storage to get past the limitations.

0

The order-preserving semantics of your problem have a marvelous property: you can subdivide the problem. You can do split -l 1000000 on the input file; the 1000000-line pieces it produces have lexically-ordered names which is good; then uniqify the pieces; and then (as a second pass) uniqify the outputs of those.

This solves the out-of-memory problem (by capping the memory requirement) at the expense of turning it into a multipass solution.

Specifically:

Generate input data:

$ cat make-uniqm-input.py
#!/usr/bin/env python
import random
n = 1000000
for i in xrange(0, n):
    print random.randint(1000, 2000)

$ python make-uniqm-input.py  > uniqm-input.txt

$ wc -l uniqm-input.txt
 1000000 uniqm-input.txt

Split up the input data:

$ split -l 10000 uniqm-input.txt

$ ls x?? | head
xaa
xab
xac
xad
xae
xaf
xag
xah
xai
xaj

$ ls x?? | wc -l
     100

$ cat x?? | wc -l
 1000000

Run the uniqifier all at once (retains all unique input lines in memory):

# 'uniqm' is any order-preserving uniq implementation, such as
# gawk '!counts[$0]++'.
$ uniqm < uniqm-input.txt > output-no-splitting.txt

$ wc -l output-no-splitting.txt
    1001 output-no-splitting.txt

Run the uniqifier on split pieces (retains only unique input lines from each piece in memory), then reduce as a second pass:

$ for x in x??; do uniqm < $x; done | uniqm > output-with-splitting.txt

$ wc -l output-with-splitting.txt
    1001 output-with-splitting.txt

Compare:

$ diff output-no-splitting.txt output-with-splitting.txt

$ head uniqm-input.txt
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

$ head output-with-splitting.txt
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

I don't know the ratio of unique to non-unique lines in your input, nor how well-mixed the input lines are -- so there is some tuning to do in terms of the number of split files you need.

0

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using a Ruby uniqifier implementation for explanatory purposes:

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen.clear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Note that this capped-uniqm technique is language-independent: you can clear the lines_seen array every N lines whether you are using awk, python, perl, C++, etc. There are set-clear methods for all these languages; I believe awk's delete is non-standard but common.

0

Using Raku (formerly known as Perl_6)

raku -ne 'state %seen; .put if ++%seen{$_} == 1;'

Sample Input:

bb
aa
bb
dd
cc
dd
aa
bb
cc

Sample Output:

bb
aa
dd
cc

Above is a solution coded in Raku, a member of the Perl-family of programmming languages. Note the need to declare the %seen hash using the state declarator (keyword). [Compare to @terdon's Perl(5) code elsewhere in this thread].

https://docs.raku.org/syntax/state
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.