1

I want to print the filename/s together with the matching pattern but only once even if the pattern match has multiple occurrence in the file.

E.g. I have a list of patterns; list_of_patterns.txt and the directory I need to find the files is /path/to/files/*.

list_of_patterns.txt:

A
B
C
D
E

/path/to/files/

/file1
/file2
/file3

Let say /file1 has the pattern A multiple times like this:

/file1:

A
4234234
A
435435435
353535
A

(Also same goes to other files where there are multiple pattern match.)

I have this grep command running but it prints the filename every time a pattern matches.

grep -Hof list_of_patterns.txt /path/to/files/*

output:

/file1:A
/file1:A
/file1:A
/file2:B
/file2:B
/file3:C
/file3:B
... and so on.

I know sort can do this when you pipe it after the grep command grep -Hof list_of_patterns.txt /path/to/files/* | sort -u but it only executes when grep is finished. In the real world, my list_of_patterns.txt has hundreds of patterns inside. It takes sometimes an hour to finish the task.

Is there a better way to speedup the process?

UPDATE: some files have more than a hundred occurrences of matching pattern. E.g. /file4 has occurrences of pattern A 900 times. That's why it's taking grep an hour to finish because it prints every occurrences of the pattern match together with the filename.

E.g. output:

/file4:A
/file4:A
/file4:A
/file4:A
/file4:A
/file4:A
/file4:A
/file4:A
... and so on til' it reach 900 occurrences.

I only want it to print only once.

E.g. Desired output:

/file4:A
/file1:A
/file2:B
/file3:A
/file4:B
11
  • Hundreds of patterns would not make grep take an hour to process a few files. Are your files also very big or do you have many thousands of files to search in? Commented Feb 14, 2018 at 6:43
  • I think the option you are looking for is -m1 Commented Feb 14, 2018 at 6:45
  • @Kusalananda, Yeah I think the files are causing this issue. I just found a file that has 1 pattern match only but with 950+ occurrences. That's why it takes an hour to finish. Commented Feb 14, 2018 at 6:47
  • @Sundeep Would that not discard the matches for some patterns? Only the first matching pattern in the pattern file would be reported. Commented Feb 14, 2018 at 6:49
  • 1
    @Kusalananda -m1 will cause exactly one output line per file, along with whatever pattern matched... not sure if OP wants one line for each matching pattern Commented Feb 14, 2018 at 6:51

1 Answer 1

3

Is there a better way to speedup the process?

Yes, it's called GNU parallel:

parallel -j0 -k "grep -Hof list_of_patterns.txt {} | sort -u" ::: /path/to/files/*
  • j N - number of jobslots. Run up to N jobs in parallel. 0 means as many as possible.
  • k (--keep-order) - keep sequence of output same as the order of input
  • ::: arguments - use arguments from the command line as input source instead of stdin (standard input)
10
  • The -j N number should possibly be limited to a number not too much higher than the available number of cores on the machine, especially if each individual grep against a file is slow. Commented Feb 14, 2018 at 7:27
  • 1
    What is the correct N for -j N? It depends: oletange.wordpress.com/2015/07/04/parallel-disk-io-is-it-faster Commented Feb 14, 2018 at 7:29
  • If mixing results is acceptable, remove -k + use --line-buffer and instead of sort -u: perl -ne '$s{$_}++ or print'. This will give results before the full job is finished. Commented Feb 14, 2018 at 7:30
  • Can I install it without sudo permission? Commented Feb 14, 2018 at 8:04
  • @WashichawbachaW, if you are ready for some manual "experiments" - you may try unix.stackexchange.com/questions/42567/… Commented Feb 14, 2018 at 8:31

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.