5

I’m writing something that deals with file matches, and I need an inversion operation. I have a list of files (e.g. from find . -type f -print0 | sort -z >lst), and a list of matches (e.g. from grep -z foo lst >matches – note that this is only an example; matches can be any arbitrary subset (including empty or full) or lst), and now I want to invert this list.

Background: I’m sorta implementing something like find(1) excepton file lists (although the files do exist in the filesystem at the point of calling, the list may have been pre-filtered). If the list of files weren’t potentially so large, I could use find "${files[@]}" -maxdepth 0 -somecondition -print0, but even moderate use of what I’m writing would go beyond the Linux or BSD argv size limit.

If the lines were not NUL-separated, I could use comm -23 lst matches >inverted. If the matches were not NUL-separated, I could use grep -Fvxzf matches lst. But, from the generators I mentioned in the first paragraph, both are.

Assume GNU tools are installed, so this needs not be portable beyond e.g. Debian, as I’m using find -print0, sort -z and friends already (although some BSDs have it, so if it can be done in “more portable”, I won’t complain).

I’m trying to do code reuse here; plus, comm -23 is basically the perfect tool for this already except it doesn’t support changing the input line separator (yet), and comm is an underrated and not-enough-well-known tool anyway. If the Unix/Linux toolbox doesn’t offer anything sensible, I’m likely to reimplement a form of comm -23 (reduced to just this one use case) in shell, as the script already (for other reasons) requires a shell that happens to support read -d '' for NUL-delimited input, but that’s going to be slow (and effort… I posted this at the end of the workday in the hopes someone has got an idea for when I pick this up tomorrow or on the 28th).

5
  • Since you already have the filenames, then maybe stackoverflow.com/a/32775270/2072269 to build the set, instead of os.walk(). But the same principle applies. Commented Dec 23, 2015 at 21:08
  • Exactly: the shell won't perform as well. A decent implementation of set will. Commented Dec 23, 2015 at 21:10
  • 2
    If you'd like you free to use comm with 2 inverts: comm -23 <(tr '\n\0' '\0\n' <lst) <(tr '\n\0' '\0\n' <matches) | tr '\n\0' '\0\n' Commented Dec 23, 2015 at 21:34
  • @Costas hm, that is an interesting idea. I will have to try it out. Commented Dec 23, 2015 at 21:39
  • Related: How to do `head` and `tail` on null-delimited input in bash? Commented Dec 23, 2015 at 21:46

2 Answers 2

6

If your comm supports non-text input (like GNU tools generally do), you can always swap NUL and nl (here with a shell supporting process substitution (have you got any plan for that in mksh btw?)):

comm -23 <(tr '\0\n' '\n\0' < file1) <(tr '\0\n' '\n\0' < file2) |
  tr '\0\n' '\n\0'

That's a common technique.

2
  • Hm yes, @Costas suggested the same in a comment 2 minutes earlier. Interesting idea, will have to try; sometimes, things need input from more than one person to truly shine, apparently (do you have a patch? it’s been on the wishlist for ages… the job management is the tricky part, not parsing). Commented Dec 23, 2015 at 21:42
  • 1
    @mirabilos, looks like other shells ignore the job management issue altogether (start them like background tasks in their own process group and don't care for them after they're started) and get away with it. Commented Dec 23, 2015 at 21:50
-2

If your are using grep to search match, you can use the -v option of grep to have line that not match.

1
  • 1
    Comments are not for extended discussion; this conversation has been moved to chat. Commented Dec 23, 2015 at 22:21

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.