1

I would like to create a tool (a shell script, or maybe a Python script) to generate and associate an hash code to a file, and then be able to use that hash to retrieve the file in the filesystem (open the parent folder and highlight the file, and/or open the file with the default application).

I am used to something similar, because I was using BibDesk, a database software to manage scientific articles, that runs only under MacOS. BibDesk uses a similar method to link the pdf files to their entries in the database, so that the association still works if you change the name to the file, or move it to another location in the filesystem.

See also this answer to a related question.

5
  • 3
    What gives the location of a file in a filesystem on Unix is the file's pathname. If you want to hash the contents of the file and use that hash as a sort of locator of the data, then you would have to associate the hash with a pathname in a database. Locating the file using the hash would involve a simple key lookup in the database. Moving the file would involve updating the database. Would you want the update and lookup (and removal) to be automatic? This sound a bit like building a filesystem on top of the already existing filesystem. Commented Nov 20, 2022 at 11:25
  • @Kusalananda I am aware of what you say. But I just know that it is possible to do it because BibDesk does it! I am just not sure how. Commented Nov 20, 2022 at 12:01
  • macOS doesn’t have a way to retrieve files by hash. That article seems written by someone who doesn’t truly understand how it works. See Marcus answer. Watch the folder (ask macOS to alert you to changes in directories in much the same way a backup tool would subscribe), then hash each file change to catch renames. Commented Nov 20, 2022 at 12:23
  • @JamesRisner it seems that it all boils down to the "alias", that in MacOS does the job (or at least part of it): apple.stackexchange.com/questions/2991/… Commented Nov 20, 2022 at 17:36
  • Related - Does Linux support invoking a program directly via its inode number? Commented Nov 20, 2022 at 19:16

3 Answers 3

4

Of course, I can't look inside BibDesk. However, by its functional description I'd say the main job it does is keep a database. In that database it would associate hashes with files.

It would then watch the folders it's supposed to, and look for files with changes. Considering even large personal literature databases will not have millions of files, even a rescan to verify the hashes of the files it finds are still as expected, would hardly be noticeable, especially if done in the background.

The files on the file systems on your computer are path-adressed, not content or hash-addressed – every additional lookup information needs to be stored separately. (You can store additional information about files in most filesystems, but to look up that information you would need to know the path of the file – hence not solving your problem.)

So, your answer over there is a tiny bit misleading – you can't use a file's content hash to retrieve a file from a filesystem. (you can of course change the name of the file to be the hash – but that's not what you meant, I think.)

However, keeping the hash in the database might be good idea for integrity reasons (you can check the hash is correct before delivering the file), and as you said, if you can afford tracking file changes, or frequent rescans, for re-discovery reasons.

8
  • Well, I am not sure how they do it in BibDesk, but they don't have a specific folder to watch and rescan: you can move the file wherever you want in the filesystem, and they are able to "find" it. And by the speed of this process I would say that they do not rely on some sort of scanning of the filesystem. Maybe it has to do with some specific feature of the Mac/FreeBSD filesystem...? Commented Nov 20, 2022 at 12:20
  • 1
    That's a valid folder to pick, "the whole filesystem". Also note that under OS X there's operating system services where you directly subscribe to changes: You don't have to find this yourself. That works on Linux, too, the API is not necessarily the same. Commented Nov 20, 2022 at 12:23
  • 1
    developer.apple.com/documentation/coreservices/… Commented Nov 20, 2022 at 12:24
  • 1
    If you want to observe file system changes, inotify (and fanotify if you need not only to observe but to intercept or do something fancy); if you want to specifically track what users do with desktop applications on documents and are targetting GNOME, the tracker suite of services comes to mind, which also try to extract content/metadata info from the file for you. I'm sure KDE has something similar. Under the hood all would use the same operating system routines to get notified about file changes. Commented Nov 20, 2022 at 12:39
  • 2
    Aliases were a HFS feature, I’m not entirely certain APFS still has that functionality? Commented Nov 20, 2022 at 19:51
0

You can do this, either in python or bash.

A hash/value database could be used to store the hash and values. Take a search for gdbm tools and libraries. The original tool originated on BSD unix, gdbm is the gnu version and they are generally not distributed in the standard install these days.

I installed gdbmtool sudo apt install gdbmtool on ubuntu (20.04) to get a utility to help me create the hash/value pairs db using bash. I sued sha256sum again a command line utility to generate the hash on the files.

My aim was different to yours, since I wanted a tool to detect photo's (and then music) that were identical but that I had renamed in the past so that I could remove duplicates, I also had them all located under one directory so I wasn't searching across the entire filesystem. I started with a tool called dedupe but I found that cumbersome for my needs. This was some time back, there are other tools to do the same these days (fslint, dupeguru,...).

In addition to the initial creation of the database, you will need something that is run regularly to update the db when the files are moved (renamed) to set the new path value, to the correct new location and to add new entries when they appear.

I don't think you want to hash the file name paths, but rather hash the contents of the file so that you can detect an existing identical file found in a new location.

Once you have the path of the file, it can be opened using xdg-open from the command line (or your script).

A relational database, perhaps even sqlite might be more appropriate, if you are wanting to track multiple paths against the same hash, because that is not allowed key-value db. But I am venturing slightly off-topic (Unix & Linux) if I go further, I have given you the names of some Unix tools to search and read up on, to get you going.

0

If I understood your question right, you could achieve this with user xattrs. Install attr package in distro you use. For example in debian/ubuntu with command sudo apt install attr. Then reand man attr, man setfattr and man getfattr Then you could save hash in files extended attributes with command like this.

setfattr -n user.md5sum -v $(md5sum test.bin | awk '{print $1}') test.bin

And read with command

getfattr -n user.md5sum  test.bin
# file: test.bin
user.foo="a7fd41d58563137a6f73e738008d9970"

Notice than attribute name must be prefixed with namespace user.

1
  • 1
    How are you suggesting to address OP's objective: finding a file by its hash? Going through the entire filesystem and comparing stored hashes to the target? Sure, it's better than computing hashes on the fly for each lookup, but still highly inefficient. Commented Feb 17, 2024 at 16:09

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.