OS X/Linux one-liner/script to find the largest recurring group of lines in a text file?

Question

I have a log containing an execution trace where there is infinite recursion eventually terminating when the stack is too deep. There are enough lines and valid included recursion within the larger block of lines that it is difficult to identify the largest block that is recurring. There is nothing unique that would require me to filter out part of the line to make this determination.

What is a good one-liner/script (in POSIX/OS X, but best if it can work in Linux and OS X) that, given a filename/pathname, could output only the largest set of lines that recur sequentially more than once?

Clarification: in my case the log file is 432003 lines and 80M:

$ wc -l long_log.txt 
432003 long_log.txt
$ du -sm long_log.txt
80  long_log.txt

To create a similar input file, try this, thanks to the post here for the method to create a file containing random words.

ruby -e 'a=STDIN.readlines;200000.times do;b=[];22.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > head.txt
ruby -e 'a=STDIN.readlines;2.times do;b=[];22.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > recurrence1.txt
ruby -e 'a=STDIN.readlines;20.times do;b=[];22.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > recurrence2.txt
ruby -e 'a=STDIN.readlines;200000.times do;b=[];22.times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > tail.txt
cat head.txt recurrence1.txt recurrence1.txt recurrence2.txt recurrence1.txt recurrence1.txt recurrence2.txt recurrence1.txt tail.txt > log.txt
cat recurrence1.txt recurrence1.txt recurrence2.txt > expected.txt

To result in:

$ wc -l log.txt 
400050 log.txt
$ du -sm log.txt
89  log.txt

Then you should be able to do:

$ recurrence log.txt > actual.txt
$ diff actual.txt expected.txt
$

It is also ok if it recognizes the other same-length block instead, i.e.

$ cat recurrence1.txt recurrence2.txt recurrence1.txt recurrence1.txt recurrence2.txt recurrence1.txt > expected2.txt
$ diff actual.txt expected2.txt
$

Would really like it to find the expected result in < 10 sec in OS X/Linux w/ a 2.6GHz quad-core Intel Core i7 and 16 GB memory.

If the output is structured, then awk would do it, but without any example output, it is impossible to provide a one-liner of anything... — jasonwryan
– jasonwryan, Commented Oct 25, 2013 at 21:47
It will take a script to do this if you're looking for multiple lines. — terdon
– terdon ♦, Commented Oct 26, 2013 at 17:43
Is this something that is going to be run regularly, or just once to find a runaway recursion bug? You know, if you need some kind of watchdog system against runaway recursion in some function, maybe it's worth building into that function, so that it doesn't have to be ferreted out of logs. Add an extra depth argument and pass down depth + 1 to the recursive calls. Have a check for excessive depth: log the situation, or abort or whatever. — Kaz
– Kaz, Commented Oct 29, 2013 at 3:50

Kaz · Accepted Answer · 2013-10-28 17:18:43Z

Solution in the TXR language.

@(next :args)
@(bind rangelim nil)
@(block)
@  (cases)
@filename
@    (maybe)
@rlim
@      (set rangelim @(int-str rlim))
@    (end)
@    (eof)
@  (or)
@    (output)
arguments are: filename [ range-limit ]
@    (end)
@    (fail)
@  (end)
@(end)
@(do
   (defun prefix-match (list0 list1)
     (let ((c 0))
       (each ((l0 list0)
              (l1 list1))
         (if (not (equal l0 l1))
           (return c))
         (inc c))
       c))

   (defun line-stream (s)
     (let (li) (gen (set li (get-line s)) li)))

   (let* ((s (line-stream (open-file filename "r")))
          (lim rangelim)
          (s* (if lim s nil))
          (h (hash :equal-based))
          (max-len 0)
          (max-line nil))
     (for ((ln 1)) (s) ((set s (rest s)) (inc ln))
       (let ((li (first s)))
         (let ((po (gethash h li))) ;; prior occurences
           (each ((line [mapcar car po])
                  (pos [mapcar cdr po]))
             (let ((ml (prefix-match pos s)))
               (cond ((and 
                        (= ml (- ln line))
                        (> ml max-len))
                      (set max-len ml)
                      (set max-line line))))))
         (pushhash h li (cons ln s))
         (if (and lim (> ln lim))
           (let* ((oldli (first s*))
                  (po (gethash h oldli))
                  (po* (remove-if (op eq s* (cdr @1)) po)))
             (if po*
               (sethash h oldli po*)
               (remhash h oldli))
             (set s* (cdr s*))))))
     (if max-line
       (format t "~a line(s) starting at line ~a\n" max-len max-line)
       (format t "no repeated blocks\n"))))

The program consists almost entirely of TXR's embedded Lisp dialect. The approach here is to keep each line from the file in a hash table. At any position in the file, we can ask the hash table, "at what positions have we seen this exact line before, if any?". If so, we can compare the file starting from that those position to the lines starting at the current position. If the match extends all the way from the previous position the current position, it means that we have a consecutive match: all the N lines from the previous position to just before the current line match the N lines starting at the current line. All we have to then is then find among all these candidate places the one that yields the longest match. (If there are ties, only the first one is reported).

Hey look, there is a repeating two-line sequence in an Xorg log file:

$ txr longseq.txr  /var/log/Xorg.0.log
2 line(s) starting at line 168

What's at line 168? These four lines:

[    19.286] (**) VBoxVideo(0):  Built-in mode "VBoxDynamicMode": 56.9 MHz (scaled from 0.0 MHz), 44.3 kHz, 60.0 Hz
[    19.286] (II) VBoxVideo(0): Modeline "VBoxDynamicMode"x0.0   56.94  1280 1282 1284 1286  732 734 736 738 (44.3 kHz)
[    19.286] (**) VBoxVideo(0):  Built-in mode "VBoxDynamicMode": 56.9 MHz (scaled from 0.0 MHz), 44.3 kHz, 60.0 Hz
[    19.286] (II) VBoxVideo(0): Modeline "VBoxDynamicMode"x0.0   56.94  1280 1282 1284 1286  732 734 736 738 (44.3 kHz)

On the other hand, the password file is all unique:

$ txr longseq.txr  /etc/passwd
no repeated blocks

The additional second argument can be used to speed up the program. If we know that the longest repeating sequence is, say, no more than 50 lines, then we can specify this. The program will then not look back farther than 50 lines. Furthermore, the memory use is proportional to the range size, not to the file size, so we win in another way.

+1 Thanks a ton for the script. Unfortunately, it was pretty slow for the large log file I have. Maybe it is just a tough problem to solve quickly. I've added some additional info to the question so you can see. — Gary S. Weaver
– Gary S. Weaver, Commented Oct 28, 2013 at 14:11
Note that the current script via alias recurrence when I do time recurrence log.txt > actual.txt it takes 1m31s, which isn't terrible: real 1m31.452s. It would be nice if could run in < 10 sec. Also, for some reason it includes some additional text as shown in the diff (24 line(s) starting at line 200003 for me, at least). Maybe I made a mistake above in an assumption? — Gary S. Weaver
– Gary S. Weaver, Commented Oct 28, 2013 at 14:24
Actually, I think the diff in expected/actual is ok in this case, as it is just including the last part rather than the first as the part of the recurring block. So, really the only slight problem is the amount of time it takes. But, this is a good solution. — Gary S. Weaver
– Gary S. Weaver, Commented Oct 28, 2013 at 14:41
One thing that affects the performance of the script is the presence of numerous "hits" for the viable start of a match, which happens if there are many identical lines in the file. Execution traces from programs will invariably be like this. If you run this on a file that consists of nothing but identical lines, then it's like the hash table is not even there. — Kaz
– Kaz, Commented Oct 28, 2013 at 15:11
Thanks, @Kaz. Still waiting on the real log to be parsed. Though it is the same size as the example above, it has more duplicate lines, so it is taking much longer than one with more unique lines. Once it is complete, I'll make the change you suggested and see how much that helps. Just a tough problem to solve quickly and simply, I guess. — Gary S. Weaver
– Gary S. Weaver, Commented Oct 28, 2013 at 15:24

Community · Accepted Answer · 2017-05-23 12:40:03Z

It turns out that the fastest and simplest way to find large blocks of duplication in a large log for me, especially when there is a lot of repetition, is:

sort long_log.txt | uniq -c | sort -k1n

(Pieced together from answers here and here.)

That took 54 sec for long_log.txt, which was the one with more repetition which appears to be a problem for a script that would do exactly what I was asking for, and 47 sec for the randomly generated one, log.txt.

The lines are out of order, and if there is recursion within the recursion, it may group those lines separately (since they may have an even greater count), but perhaps you could use the data from this method and then go back into the log to find and extract the relevant portion(s).

This command could be put into your .bashrc/.bash_profile as a function:

recurrence() {
  sort "$1" | uniq -c | sort -k1n
}

So that it could be called as:

recurrence long_log.txt

user26053 · Accepted Answer · 2013-11-01 23:54:16Z

0

Here is a solution in bash for you. I actually have a script for it; but here is the one liner:

find $PWD -regextype posix-extended -iregex '.*\.(php|pl)$' -type f | xargs wc -L 2> /dev/null | grep -v 'total' | sort -nrk1 | head -n 30 | awk 'BEGIN { printf "\n%-15s%s\n", "Largest Line", "File"; } { printf "%-15s%s\n", $1, $2; }'

I use it to find hacked files on hacked sites; so you can remove the -regex.

answered Nov 1, 2013 at 23:54

user26053

Are you sure that the answer you provided matches the question? It isn't about finding the largest line.

Gary S. Weaver
– Gary S. Weaver

2013-11-04 02:11:49 +00:00
Commented Nov 4, 2013 at 2:11

Add a comment |

Stack Exchange Network

OS X/Linux one-liner/script to find the largest recurring group of lines in a text file?

3 Answers 3

You must log in to answer this question.

Hot Network Questions

OS X/Linux one-liner/script to find the largest recurring group of lines in a text file?

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions