Revisions to Remove duplicate lines while keeping the order of the lines

stick more closely to the OP's question

Source Link

edited Nov 20, 2016 at 15:37

196
5

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using thea Ruby impluniqifier implementation for explanatory purposes (since it has a low line-count):

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen.clear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Note that this capped-uniqm technique is language-independent: you can clear the lines_seen array every N lines whether you are using awk, python, perl, C++, etc. There are set-clear methods for all these languages; I believe awk's delete is non-standard but common.

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using the Ruby impl for explanatory purposes (since it has a low line-count):

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen.clear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Note that this capped-uniqm technique is language-independent: you can clear the lines_seen array every N lines whether you are using awk, python, perl, C++, etc.

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using a Ruby uniqifier implementation for explanatory purposes:

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen.clear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Note that this capped-uniqm technique is language-independent: you can clear the lines_seen array every N lines whether you are using awk, python, perl, C++, etc. There are set-clear methods for all these languages; I believe awk's delete is non-standard but common.

added 168 characters in body

Source Link

edited Nov 20, 2016 at 15:14

John Kerl

196
5

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using the Ruby impl for explanatory purposes (since it has a low line-count):

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen.clear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Note that this capped-uniqm technique is language-independent: you can clear the lines_seen array every N lines whether you are using awk, python, perl, C++, etc.

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using the Ruby impl for explanatory purposes (since it has a low line-count):

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen.clear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using the Ruby impl for explanatory purposes (since it has a low line-count):

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen.clear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Note that this capped-uniqm technique is language-independent: you can clear the lines_seen array every N lines whether you are using awk, python, perl, C++, etc.

deleted 4 characters in body

Source Link

edited Nov 20, 2016 at 14:54

John Kerl

196
5

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using the Ruby impl for explanatory purposes (since it has a low line-count):

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen = Set.newclear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using the Ruby impl for explanatory purposes (since it has a low line-count):

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen = Set.new
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Another approach (worth posting as a separate answer) is: instead of the split-file approach which creates temp files, do the batching within the uniqifier software itself. For example, using the Ruby impl for explanatory purposes (since it has a low line-count):

require 'set'
line_batch_count = 50000 # tunable parameter
lines_seen = Set.new
line_number = 0
ARGF.each do |line|
   line_number += 1
   if (line_number % line_batch_count) == 0
     lines_seen.clear
   end
   unless lines_seen.include? line
      puts line
      lines_seen << line
   end
end

The idea is to clear out the hash-set every so often. Then this becomes iterative:

$ cat uniqm-input.txt | ruby uniqm-capped.rb | wc -l
   20021

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | wc -l
    1001

$ cat uniqm-input.txt | ruby uniqm-capped.rb | ruby uniqm-capped.rb | head
1506
1054
1623
1002
1173
1400
1226
1340
1824
1091

So you could run this capped version repeatedly, until the line-count doesn't change from one iteration to the next.

Source Link

answered Nov 20, 2016 at 14:48

John Kerl

196
5

Loading

Stack Exchange Network

Return to Answer