1

I have a a file that contains more than a hundred thousand of IDs. Each ID is composed of 8~16 hexadecimal digits:

178540899f7b40a3
6c56068d
8c45235e9c
8440809982cc
6cb8fef5e5
7aefb0a014a448f
8c47b72e1f824b
ca4e88bec
...

I need to find the related files in a directory tree that contains around 2×109 files.

Given an ID like 6c56068d219144dd, I can find its corresponding files with:

find /dir -type f -name '* 6[cC]56068[dD]219144[dD][dD] *'

But that takes at least two days to complete...

What I would like to do is to call find with as much -o -iname GLOB triplets as ARG_MAX allows.

Here's what I've thought of doing:

sed -e 's/.*/-o -iname "* & *"' ids.txt |
xargs find /dir -type f -name .

My problem is that I can't force xargs to take in only complete triplets.

How can I do it?

8
  • Apologies to the OP and to ilkkachu. I thought I knew how xargs worked, but I was obviously wrong. Yet another reminder not to touch that utility again :-) Commented Aug 31, 2023 at 20:13
  • Your idea was good, just missing an additional step that I'm conceiving right now Commented Aug 31, 2023 at 20:15
  • @ilkkachu Each call takes a long time, with almost no difference with any given number of arguments; if I can do the job with the least number of calls then it would be great. Commented Aug 31, 2023 at 20:42
  • 2
    Thanks for editing, but please don't put placeholder text since that makes the question completely useless. It's OK, we'll wait until you have finished editing. Commented Sep 1, 2023 at 12:32
  • 1
    With the edit, this does seem like a case where find | grep might make sense Commented Sep 1, 2023 at 13:45

2 Answers 2

4

That's the wrong approach, if the point is to find all the files whose name has one of those IDs as any one of their space delimited words, then you could do:

find /dir -type f -print0 |
  gawk '
    !ids_processed {ids[$0]; next}
    {
      n = split(tolower($NF), words, " ")
      for (i = 1; i <= n; i++)
        if (words[i] in ids) {
          print
          break
        }
    }' ids.txt ids_processed=1 RS='\0' FS=/ -

Then you process the file list only once, and looking up the 100k ids is just a lookup in a hash table instead of doing up to 100k regex/wildcard matchings.

4
  • I've never seen awk being called like that, with some variables been defined after the first file name. Where is this behaviour defined? Couldn't find it in the man, maybe I missed it. I understand what you're trying to do, which is setting some variables only after the first file is read completely, but how is this allowed? For instance, how come awk doesn't look for a file named ids_processed=1 and rather treats it as new variable definition? Commented Sep 4, 2023 at 8:08
  • 1
    Ok found it: "If a filename on the command line has the form var=val it is treated as a variable assignment. The variable var will be assigned the value val. (This happens after any BEGIN block (s) have been run.) Command line variable assignment is most useful for dynamically assigning values to the variables AWK uses to control how input is broken into fields and records. It is also useful for controlling state if multiple passes are needed over a single data file." Nice! I never knew you could do that! Commented Sep 4, 2023 at 8:10
  • 1
    Note that contrary to -v var=value, that was also available in the original awk from the late 70s (-v was added in nawk in the 80s). That means it can't process files whose name contains = characters if what's left of the first = is a valid awk variable name. That's why you need awk '...' ./*.txt instead of awk -- '...' *.txt for instance (you'll find several answers here mentioning this kind or problem). See also the -E of gawk to work around it. Commented Sep 4, 2023 at 10:45
  • excellent : much faster approach, and with lots of precise not-well-known informations, as usual for your anwsers! Thank you. (I learned about the variable interpretation in the argument list, and its workaround. Using ./files* is anyway almost always preferable to files* for many usages, and is a good habit to take, as it avoids several pitfalls for many commands (ex: interpreting the characters in a filename beginning with a '-' as options for rm, etc) ). You should write a book with all the tips and "good to know" things about the shell and many utilities Commented Sep 4, 2023 at 14:54
1

What I would do:

Write a script to save all the file names to a temporary:

# maybe run this from cron or behind inotifywait
find dir -type f -print > /tmp/filelist

Then do a lookup as needed using your input file:

fgrep -if hexids /tmp/filelist 

The -i parameter ignores case and -f reads the strings to test against from hexids.

I might suggest using -wif instead of -if but from the other comments it's not clear that you are providing accurate information in your question. The -w command searches the input regex/names against whole words in the file list. man grep for more information about what it considers words.

2
  • That would look for the ids in the whole file paths, not just their (base) names. Commented Sep 4, 2023 at 10:51
  • Yes, and...? The original question appears to try to separate on word boundaries, and says all the files are in one directory. I provided the -w option. The sample shows a solution to the problem posted, not any other. Commented Sep 5, 2023 at 12:03

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.