Skip to main content
3 of 4
added 168 characters in body

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using the Ruby impl for explanatory purposes (since it has a low line-count):

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen.clear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Note that this capped-uniqm technique is language-independent: you can clear the lines_seen array every N lines whether you are using awk, python, perl, C++, etc.