1

I need to diff on a file that isn't ASCII and haven't newlines. I want ideally to see what is new, what is deleted, what is modified, and maybe see the proportion (percentage, size) who is different from the other file

Problem is that diff only seems to tell me about deleted/added lines, who is a problem in my case. And it seems to not do well at all with non ASCII files, trying to show them as ASCII so not displaying any relevant datas

4
  • What about using cmp ? Commented Mar 5, 2022 at 15:27
  • Given the characters are non-ASCII, how would you like to "see" them? I have a method that would be able to show changes (including longer / shorter realignment), but would it need to show Modified: NL ESC v 1 7 to CR J 0 1 2 7 8 9? Would plain hex do ? Do you need to see the offsets in the files of the changes? Commented Mar 5, 2022 at 22:33
  • @MC68020 The issue with cmp is that is does not look-ahead to realign on the next match. As soon as it gets one inserted character, it reports every following character pair as different. Commented Mar 5, 2022 at 23:53
  • 1
    Depends on what the files contain. As you said, diff works over lines, so to use it, you have to turn the data into lines before feeding to diff. A binary file in general is just a blob of bytes, and seeing some of them change might not be meaningful. Also what should be done if some data is inserted in the middle or removed, with the remainder shifting to another position? If the file has some sorts of records, you might want to run the diff over those. Just turn the data into text and then diff works. But e.g. an image file would need completely different treatment, esp. if compressed. Commented Mar 6, 2022 at 18:54

2 Answers 2

3

diff only works on text files. To run diff on binary files, you first need to convert them to text, e.g. with xxd or hexdump. This is easy enough to do on-the-fly with process substitution.

e.g.

$ cat file1
A B C D E

$ cat file2
A B X D E

$ diff -u <(xxd file1) <(xxd file2)
--- /dev/fd/63  2022-03-06 15:40:23.811027810 +1100
+++ /dev/fd/62  2022-03-06 15:40:23.811027810 +1100
@@ -1 +1 @@
-00000000: 4120 4220 4320 4420 450a                 A B C D E.
+00000000: 4120 4220 5820 4420 450a                 A B X D E.

And, yes, file1 and file2 are text files, but text files are a subset of binary files that happen to only contain "text" characters. It was easier to create text files for this example.

Worth noting: Even tiny changes in a binary file (e.g. the addition or deletion of even one byte) can cause diff's output to be enormous. That's because every line of xxd or hexdump's output after that tiny change will be different. Hence doing this is not recommended. You could redirect diff's output to /dev/null and check the exit code, but if you only wanted to know if the files were different, it would be better to just run cmp instead.

Solution: Use a binary diff tool, like one of those shown below, which are often used to generate patch-files for binaries. For example:

$ apropos diff | grep binary
radiff2 (1)          - unified binary diffing utility
bsdiff (1)           - generate a patch between two binary files
xdelta3 (1)          - VCDIFF (RFC 3284) binary diff tool

e.g.

$ radiff2 file1 file2
0x00000004 43 => 58 0x00000004
$ 

radiff2 also has a unidiff output option, which may be more readable (i don't know whether the output is any smaller than diffing xxd dumps of two large files, though):

$ radiff2 -u file1 file2 
-0x00000004:43  "C D E\n"
+0x00000004:58  "X D E\n"

If the files are the same, radiff2 won't output anything:

$ radiff2 file1 file1
$ 
1
  • 1
    Just dumping the data into numbers, one byte per line, without offsets would be far, far more useful if there's any chance of insertions or deletions. The default output is especially nasty since having the file position printed on each line makes it completely impossible for diff to detect deletions and insertions. Something like od -tx1 -w1 -An with at least GNU could produce something that diff might be able to look into. Commented Mar 6, 2022 at 19:00
0

On the command line I use vbindiff.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.