10

I have two files. One file, I suspect, is a subset of the other. Is there a way to diff the files to identify (in a succinct manner) where in the first file the second file fits?

3
  • Related: unix.stackexchange.com/questions/79135/… Commented Oct 29, 2013 at 20:00
  • Do you mean the lines of one file are a subsequence of the other, or actually a contiguous substring? Commented Oct 30, 2013 at 3:40
  • A contiguous substring, @Kaz. Commented Oct 30, 2013 at 3:58

5 Answers 5

14

diff -e bigger smaller will do the trick, but requires some interpretation, as the output is a "valid ed script".

I made two files, "bigger" and "smaller", where the contents of "smaller" is identical to lines 5 through 9 of "bigger" doing `diff -e bigger smaller" got me:

% diff -e bigger smaller
10,15d
1,4d

Which means "delete lines 10 through 15 of 'bigger', and then delete lines 1 through 4, to get 'smaller'". That means "smaller" is lines 5 through 9 of "bigger".

Reversing the file names got me something more complicated. If "smaller" truly constitutes a subset of "bigger", only 'd' (for delete) commands will show up in the output.

5

You can do this visually with meld. Unfortunately, it is a GUI tool but if you just want to do this once, and on a relatively small file, it should be fine:

The image below is the output of meld a b:

enter image description here

3
  • 1
    Meld's nice, but it doesn't play quite as well with 100MB+ files. Commented Oct 29, 2013 at 20:10
  • @Richard no it doesn't and I would prefer a command line tool anyway, I just thought I'd mention it. Commented Oct 29, 2013 at 20:12
  • Looks a lot like vimdiff, which is available in terminal. Commented Nov 5, 2013 at 23:35
2

If the files are small enough, you can slurp them both into Perl and have its regex engine do the trick:

perl -0777e '
        open "$FILE1","<","file_1";
        open "$FILE2","<","file_2";
        $file_1 = <$FILE1>;
        $file_2 = <$FILE2>;
        print "file_2 is", $file_1 =~ /\Q$file_2\E/ ? "" : "not";
        print " a subset of file_1\n";
'

The -0777 switch instructs Perl to set its input record separator $/ to the undefined value so as to slurp files completely.

9
  • 1
    What does 777 do? I take it you are passing NULL as $/ but why? Also since these are kinda esoteric switches, an explanation would be nice for the non-perl people. Commented Oct 29, 2013 at 19:57
  • 1
    @terdon I am indeed doing it to slurp the files whole. Explanation added. Commented Oct 29, 2013 at 20:04
  • But why is that necessary? $a=<$fh> should slurp anyway right? Commented Oct 29, 2013 at 20:06
  • 1
    @terdon Not that I know of, no. By default $/ is set to \n so that $a=<$fh> would read only one line of the file $fh has been opened to. Unless of course perl's command-line behavior has different defaults that I'm unaware of? Commented Oct 29, 2013 at 20:09
  • Argh, yes, my bad, I almost never slurp files or use the while $foo=<FILE> idiom so I wasn't sure and ran a (wrong) test which seemed to work. Never mind :). Commented Oct 29, 2013 at 20:11
1

If the files are text files and smaller, within bigger starts at the beginning of a line, it's not too difficult to implement with awk:

awk -v i=0 'NR==FNR{l[n++]=$0;next}
    {if ($0 == l[i]) {if (++i == n) {print FNR-n+1;exit}} else i=0}
    ' smaller bigger
1

Your question is "Diff head of files". If you really mean that one file is the head of the other, then a simple cmp will tell you that:

cmp big_file small_file
cmp: EOF on small_file

That tells you that a difference between the two files was not detected until end-of-file was reached while reading small_file.

If however you mean that the entire text of small file can occur anywhere inside big_file, then assuming you can fit both files in memory, you can use

perl -le '
   use autodie;
   undef $/;
   open SMALL, "<", "small_file";
   open BIG, "<", "big_file";
   $small = <SMALL>;
   $big = <BIG>;
   $pos = index $big, $small;
   print $pos if $pos >= 0;
'

This will print the offset within big_file where the contents of small_file are located (e.g. 0 if small_file matches at the beginning of big_file). If small_file does not match inside big_file, then nothing will be printed. If there is an error, the exit status will be non-zero.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.