Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using thea Ruby impluniqifier implementation for explanatory purposes (since it has a low line-count):
require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
line_number += 1
if (line_number % line_batch_count) == 0
lines_seen.clear
end
unless lines_seen.include? line
puts line
lines_seen << line
end
end
The idea is to clear out the hash-set every so often. Then this becomes iterative:
$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
20021
$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
1001
$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091
So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.
Note that this capped-uniqm technique is language-independent: you can clear the lines_seen array every N lines whether you are using awk, python, perl, C++, etc. There are set-clear methods for all these languages; I believe awk's delete is non-standard but common.