I have two files. One file, I suspect, is a subset of the other. Is there a way to diff the files to identify (in a succinct manner) where in the first file the second file fits?
-
Related: unix.stackexchange.com/questions/79135/…slm– slm ♦2013-10-29 20:00:00 +00:00Commented Oct 29, 2013 at 20:00
-
Do you mean the lines of one file are a subsequence of the other, or actually a contiguous substring?Kaz– Kaz2013-10-30 03:40:37 +00:00Commented Oct 30, 2013 at 3:40
-
A contiguous substring, @Kaz.Richard– Richard2013-10-30 03:58:33 +00:00Commented Oct 30, 2013 at 3:58
5 Answers
diff -e bigger smaller will do the trick, but requires some interpretation, as the output is a "valid ed script".
I made two files, "bigger" and "smaller", where the contents of "smaller" is identical to lines 5 through 9 of "bigger" doing `diff -e bigger smaller" got me:
% diff -e bigger smaller
10,15d
1,4d
Which means "delete lines 10 through 15 of 'bigger', and then delete lines 1 through 4, to get 'smaller'". That means "smaller" is lines 5 through 9 of "bigger".
Reversing the file names got me something more complicated. If "smaller" truly constitutes a subset of "bigger", only 'd' (for delete) commands will show up in the output.
You can do this visually with meld. Unfortunately, it is a GUI tool but if you just want to do this once, and on a relatively small file, it should be fine:
The image below is the output of meld a b:

-
1Meld's nice, but it doesn't play quite as well with 100MB+ files.Richard– Richard2013-10-29 20:10:38 +00:00Commented Oct 29, 2013 at 20:10
-
@Richard no it doesn't and I would prefer a command line tool anyway, I just thought I'd mention it.2013-10-29 20:12:15 +00:00Commented Oct 29, 2013 at 20:12
-
Looks a lot like
vimdiff, which is available in terminal.phemmer– phemmer2013-11-05 23:35:35 +00:00Commented Nov 5, 2013 at 23:35
If the files are small enough, you can slurp them both into Perl and have its regex engine do the trick:
perl -0777e '
open "$FILE1","<","file_1";
open "$FILE2","<","file_2";
$file_1 = <$FILE1>;
$file_2 = <$FILE2>;
print "file_2 is", $file_1 =~ /\Q$file_2\E/ ? "" : "not";
print " a subset of file_1\n";
'
The -0777 switch instructs Perl to set its input record separator $/ to the undefined value so as to slurp files completely.
-
1What does
777do? I take it you are passing NULL as$/but why? Also since these are kinda esoteric switches, an explanation would be nice for the non-perl people.2013-10-29 19:57:51 +00:00Commented Oct 29, 2013 at 19:57 -
1@terdon I am indeed doing it to slurp the files whole. Explanation added.Joseph R.– Joseph R.2013-10-29 20:04:00 +00:00Commented Oct 29, 2013 at 20:04
-
But why is that necessary?
$a=<$fh>should slurp anyway right?2013-10-29 20:06:13 +00:00Commented Oct 29, 2013 at 20:06 -
1@terdon Not that I know of, no. By default
$/is set to\nso that$a=<$fh>would read only one line of the file$fhhas been opened to. Unless of courseperl's command-line behavior has different defaults that I'm unaware of?Joseph R.– Joseph R.2013-10-29 20:09:21 +00:00Commented Oct 29, 2013 at 20:09 -
Argh, yes, my bad, I almost never slurp files or use the
while $foo=<FILE>idiom so I wasn't sure and ran a (wrong) test which seemed to work. Never mind :).2013-10-29 20:11:38 +00:00Commented Oct 29, 2013 at 20:11
If the files are text files and smaller, within bigger starts at the beginning of a line, it's not too difficult to implement with awk:
awk -v i=0 'NR==FNR{l[n++]=$0;next}
{if ($0 == l[i]) {if (++i == n) {print FNR-n+1;exit}} else i=0}
' smaller bigger
Your question is "Diff head of files". If you really mean that one file is the head of the other, then a simple cmp will tell you that:
cmp big_file small_file
cmp: EOF on small_file
That tells you that a difference between the two files was not detected until end-of-file was reached while reading small_file.
If however you mean that the entire text of small file can occur anywhere inside big_file, then assuming you can fit both files in memory, you can use
perl -le '
use autodie;
undef $/;
open SMALL, "<", "small_file";
open BIG, "<", "big_file";
$small = <SMALL>;
$big = <BIG>;
$pos = index $big, $small;
print $pos if $pos >= 0;
'
This will print the offset within big_file where the contents of small_file are located (e.g. 0 if small_file matches at the beginning of big_file). If small_file does not match inside big_file, then nothing will be printed. If there is an error, the exit status will be non-zero.