1

I've got a fragment of an image file produced by data-recovery software. I suspect the complete original is somewhere on my home fileserver.

If this were a fragment of a text file, I could just grab a unique-looking fragment, run grep -r -l -F , and come back in a few hours for the answer. However, since this is a binary file, it's got all sorts of things that grep doesn't like (such as null bytes), and even if I can get past that, I don't know how to give grep input that isn't valid UTF-8.

How can I search for the original, preferably without writing my own search program?

(This is not a duplicate of this question: despite the likely-sounding title, that one is about finding strings in binary files, where I'm looking for binary data in binary files.)

2

5 Answers 5

0

What I would do:

grep -a -r -l -F <fixed string> .

-a, --text
Process a binary file as if it were text; this is equivalent to the --binary-files=text option.

or

find . -type f -exec sh -c '
    strings "$1" | grep -lF <fixed pattern>
' sh {} \;
 

strings - print the sequences of printable characters in files

7
  • In the general case, how would I get the "fixed pattern"? For this particular file, strings produces some fragments that look unique (as well as several chapters of an ebook -- data recovery does funny things sometimes), but I can't count on that happening with any arbitrary file. Commented Feb 28, 2023 at 23:09
  • I dunno what exactly you are trying to do. The commands I gave are meant to search binary files as requested Commented Feb 28, 2023 at 23:14
  • The commands are half of what I need: how to search. The other half is "what to search for": how do I pull a piece out of the file I've got and tell grep "search for this"? Commented Mar 1, 2023 at 0:37
  • open the recovered fragment file with a text editor ... copy a section ... confirm that grep finds the recovered fragment file ... then search for the original file ... actually Notepad++ can do what you are asking Commented Mar 1, 2023 at 1:08
  • @jsotola, the whole point of this question is that I'm dealing with a binary file, not a text file. Opening it in a text editor will give me nothing useful, what with all the nulls, control characters, and other non-text things in the file. Commented Mar 1, 2023 at 1:35
0

You could first dump the binary file using od :

I suggest using the -x and -w256 options in order to reduce the size of the file and the number of lines in order to maximize grep efficiency and necessarily the -A n option in order to remove the needless offset address, let's have :

od -x -A n -w256 yourbinary_fragment > pattern.txt

You could also make aggressive use of the -j -N and -w options or even reedit pattern.txt in order to reduce the number of lines to some strict minimum. (In order to significantly ease grep's work)

Then find for the files matching the patterns after being themselves dumped

find . -type f -exec sh -c '
    od -x -A n -w256 "$1" | grep -lFf pattern.txt
' sh {} \;

If using your machine for other purposes, I'd suggest to SCHED_BATCH that process.

0

With perl and the Sys::Mmap module (in libsys-mmap-perl package on Debian):

fragment=/path/to/your/fragment
size=$(( $(wc -c < "$fragment") - 1 ))
find . -type f -size "+${size}c" -print0 | 
  perl -MSys::Mmap -l -0sne '
    BEGIN {
      open N, "<", $needle or die "$ARGV[0]: $!\n";
      mmap($n, 0, PROT_READ, MAP_SHARED, N);
    }
    if (open H, "<", $_) {
      mmap($h, 0, PROT_READ, MAP_SHARED, H);
      print if index($h, $n) >=0;
    } else {
      warn "$_: $!\n";
    }' -- -needle="$fragment"
0

If you suspect one file is the first part of a different file, you could take the first few bytes from both files and compare these:

# Omit or change the bytes arguments as needed, see `man head`
head --bytes=1032 file1.bin > /tmp/file1.head.bin
head --bytes=1032 file2.bin > /tmp/file2.head.bin
diff --text /tmp/file.head.*

You could also visually look at the files using xxd /tmp/fil1.head.bin. Finally, programs like Meld or Beyond Compare show you visual side-by-side comparisons of the files.

2
  • The problem with this is that the beginnings of binary files tend to be very formulaic. For example, indexed Windows Bitmap files of a given size will tend to be identical for the first 1078 bytes. Commented Mar 1, 2023 at 21:50
  • For binaries you're indeed right, though it shouldn't be a problem for Image Files. Otherwise, it's simply a matter of grabbing the first 10k or so. Commented Mar 3, 2023 at 15:27
0

Here's how you can do grep to search for a snippet of a binary file on your computer, even though grep defaults to text parsing.

In short

LC_ALL=C grep -aRl -F --mmap <fixed string> <path>

In long

Setting LC_ALL=C is important to tell grep to use only ASCII formatting. By default, grep is likely using unicode formatting. If grep tries to use the unicode format on a binary file that is not unicode formatted, it can cause matches to be missed!!!

Lets say you have a binary file with the following content:

$ hexdump -C test_file.bin
00000000  c6 67 72 65 70 0a |.grep.|
00000006 

In the unicode format, the c6 character is a continuation bytes that the following character (g) is part of a unicode character, but does not represent a literal "g", and so running grep 'grep' test_file.bin will fail. Unless it is prefixed by LC_ALL=C.

-a Tell grep to try to match in the file even if it is a "binary" file. i.e it contains null characters.

-R allow grep to recursively search specified path

-l Tell grep to report "files with match". Don't print parts of the file, just print the file path. Helpful when you are searching your system for a file.

-F Specifies the pattern to search for is a "fixed string", and don't interpret as a regular expression.

--mmap Tell grep to use mmap for a minor performance improvement when supported.

On the LC_ALL environment variable, see also https://superuser.com/a/1772561/971386

2
  • While procedurally correct, this description conflates "Unicode" (the charset) and "UTF-8" (an 8-bit encoding often used for Unicode; and note that's an encoding, not a formatting). Also, UTF-8 is self-synchronising; if you see byte 0x67 then it's always 'g'; it can never be a continuation byte. The fact that there's no valid continuation byte following 0xc6 is an error, but it doesn't cause misintepretation of 0x67. Commented Nov 14, 2024 at 12:50
  • As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. Commented Nov 14, 2024 at 12:52

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.