I have a log containing an execution trace where there is infinite recursion eventually terminating when the stack is too deep. There are enough lines and valid included recursion within the larger block of lines that it is difficult to identify the largest block that is recurring. There is nothing unique that would require me to filter out part of the line to make this determination.
What is a good one-liner/script (in POSIX/OS X, but best if it can work in Linux and OS X) that, given a filename/pathname, could output only the largest set of lines that recur sequentially more than once?
Clarification: in my case the log file is 432003 lines and 80M:
$ wc -l long_log.txt
432003 long_log.txt
$ du -sm long_log.txt
80 long_log.txt
To create a similar input file, try this, thanks to the post here for the method to create a file containing random words.
ruby -e 'a=STDIN.readlines;200000.times do;b=[];22.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > head.txt
ruby -e 'a=STDIN.readlines;2.times do;b=[];22.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > recurrence1.txt
ruby -e 'a=STDIN.readlines;20.times do;b=[];22.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > recurrence2.txt
ruby -e 'a=STDIN.readlines;200000.times do;b=[];22.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > tail.txt
cat head.txt recurrence1.txt recurrence1.txt recurrence2.txt recurrence1.txt recurrence1.txt recurrence2.txt recurrence1.txt tail.txt > log.txt
cat recurrence1.txt recurrence1.txt recurrence2.txt > expected.txt
To result in:
$ wc -l log.txt
400050 log.txt
$ du -sm log.txt
89 log.txt
Then you should be able to do:
$ recurrence log.txt > actual.txt
$ diff actual.txt expected.txt
$
It is also ok if it recognizes the other same-length block instead, i.e.
$ cat recurrence1.txt recurrence2.txt recurrence1.txt recurrence1.txt recurrence2.txt recurrence1.txt > expected2.txt
$ diff actual.txt expected2.txt
$
Would really like it to find the expected result in < 10 sec in OS X/Linux w/ a 2.6GHz quad-core Intel Core i7 and 16 GB memory.
awkwould do it, but without any example output, it is impossible to provide a one-liner of anything...depthargument and pass downdepth + 1to the recursive calls. Have a check for excessive depth: log the situation, or abort or whatever.