Diff head of files

Question

I have two files. One file, I suspect, is a subset of the other. Is there a way to diff the files to identify (in a succinct manner) where in the first file the second file fits?

Do you mean the lines of one file are a subsequence of the other, or actually a contiguous substring? — Kaz
– Kaz, Commented Oct 30, 2013 at 3:40

score 14 · Accepted Answer · 2013-10-29 19:52:39Z

diff -e bigger smaller will do the trick, but requires some interpretation, as the output is a "valid ed script".

I made two files, "bigger" and "smaller", where the contents of "smaller" is identical to lines 5 through 9 of "bigger" doing `diff -e bigger smaller" got me:

% diff -e bigger smaller
10,15d
1,4d

Which means "delete lines 10 through 15 of 'bigger', and then delete lines 1 through 4, to get 'smaller'". That means "smaller" is lines 5 through 9 of "bigger".

Reversing the file names got me something more complicated. If "smaller" truly constitutes a subset of "bigger", only 'd' (for delete) commands will show up in the output.

terdon · Accepted Answer · 2013-10-29 20:12:37Z

5

You can do this visually with meld. Unfortunately, it is a GUI tool but if you just want to do this once, and on a relatively small file, it should be fine:

The image below is the output of meld a b:

enter image description here

edited Oct 29, 2013 at 20:12

answered Oct 29, 2013 at 19:47

terdon♦

252k69 gold badges480 silver badges718 bronze badges

1

Meld's nice, but it doesn't play quite as well with 100MB+ files.

Richard
– Richard

2013-10-29 20:10:38 +00:00
Commented Oct 29, 2013 at 20:10
@Richard no it doesn't and I would prefer a command line tool anyway, I just thought I'd mention it.

terdon
– terdon ♦

2013-10-29 20:12:15 +00:00
Commented Oct 29, 2013 at 20:12
Looks a lot like vimdiff, which is available in terminal.

phemmer
– phemmer

2013-11-05 23:35:35 +00:00
Commented Nov 5, 2013 at 23:35

Add a comment |

Joseph R. · Accepted Answer · 2013-10-29 20:03:26Z

2

If the files are small enough, you can slurp them both into Perl and have its regex engine do the trick:

perl -0777e '
        open "$FILE1","<","file_1";
        open "$FILE2","<","file_2";
        $file_1 = <$FILE1>;
        $file_2 = <$FILE2>;
        print "file_2 is", $file_1 =~ /\Q$file_2\E/ ? "" : "not";
        print " a subset of file_1\n";
'

The -0777 switch instructs Perl to set its input record separator $/ to the undefined value so as to slurp files completely.

edited Oct 29, 2013 at 20:03

answered Oct 29, 2013 at 19:50

Joseph R.

40.5k8 gold badges113 silver badges146 bronze badges

1

What does 777 do? I take it you are passing NULL as $/ but why? Also since these are kinda esoteric switches, an explanation would be nice for the non-perl people.

terdon
– terdon ♦

2013-10-29 19:57:51 +00:00
Commented Oct 29, 2013 at 19:57
1

@terdon I am indeed doing it to slurp the files whole. Explanation added.

Joseph R.
– Joseph R.

2013-10-29 20:04:00 +00:00
Commented Oct 29, 2013 at 20:04
But why is that necessary? $a=<$fh> should slurp anyway right?

terdon
– terdon ♦

2013-10-29 20:06:13 +00:00
Commented Oct 29, 2013 at 20:06
1

@terdon Not that I know of, no. By default $/ is set to \n so that $a=<$fh> would read only one line of the file $fh has been opened to. Unless of course perl's command-line behavior has different defaults that I'm unaware of?

Joseph R.
– Joseph R.

2013-10-29 20:09:21 +00:00
Commented Oct 29, 2013 at 20:09
Argh, yes, my bad, I almost never slurp files or use the while $foo=<FILE> idiom so I wasn't sure and ran a (wrong) test which seemed to work. Never mind :).

terdon
– terdon ♦

2013-10-29 20:11:38 +00:00
Commented Oct 29, 2013 at 20:11

| Show 4 more comments

Stéphane Chazelas · Accepted Answer · 2013-10-29 21:14:29Z

1

If the files are text files and smaller, within bigger starts at the beginning of a line, it's not too difficult to implement with awk:

awk -v i=0 'NR==FNR{l[n++]=$0;next}
    {if ($0 == l[i]) {if (++i == n) {print FNR-n+1;exit}} else i=0}
    ' smaller bigger

answered Oct 29, 2013 at 21:14

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

Add a comment |

Joseph R. · Accepted Answer · 2013-11-05 22:54:31Z

Your question is "Diff head of files". If you really mean that one file is the head of the other, then a simple cmp will tell you that:

cmp big_file small_file
cmp: EOF on small_file

That tells you that a difference between the two files was not detected until end-of-file was reached while reading small_file.

If however you mean that the entire text of small file can occur anywhere inside big_file, then assuming you can fit both files in memory, you can use

perl -le '
   use autodie;
   undef $/;
   open SMALL, "<", "small_file";
   open BIG, "<", "big_file";
   $small = <SMALL>;
   $big = <BIG>;
   $pos = index $big, $small;
   print $pos if $pos >= 0;
'

This will print the offset within big_file where the contents of small_file are located (e.g. 0 if small_file matches at the beginning of big_file). If small_file does not match inside big_file, then nothing will be printed. If there is an error, the exit status will be non-zero.

Stack Exchange Network

Diff head of files

5 Answers 5

You must log in to answer this question.

Linked

Hot Network Questions

Diff head of files

5 Answers 5

You must log in to answer this question.

Linked

Related

Hot Network Questions