Revisions to Using a single command-line command, how would I search every text file in a database to find the 10 most used words?

deleted 4 characters in body

Source Link

edited Feb 15, 2018 at 21:34

355.8k
42
735
1.1k

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data that whose filenamefilenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data that whose filename matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data whose filenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

added 75 characters in body

Source Link

edited Feb 14, 2018 at 21:59

Kusalananda ♦

355.8k
42
735
1.1k

grep will show the filename of each file that matches the filepattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may justactually cat all the text files together and process it as one big document (which is more or less what you do, but by using grep to "look for any character", which is much slower)like this:

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data that whose filename matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

grep will show the filename that matches the file if more than one file is searched, which is what's happening in your case.

Instead of using grep you may just cat all the text files together and process it as one big document (which is more or less what you do, but by using grep to "look for any character", which is much slower):

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data that whose filename matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data that whose filename matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

added 256 characters in body

Source Link

edited Feb 14, 2018 at 21:37

Kusalananda ♦

355.8k
42
735
1.1k

grep will show the filename that matches the file if more than one file is searched, which is what's happening in your case.

Instead of using grep you may just cat all the text files together and process it as one big document (which is more or less what you do, but by using grep to "look for any character", which is much slower):

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data that whose filename matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

grep will show the filename that matches the file if more than one file is searched, which is what's happening in your case.

Instead of using grep you may just cat all the text files together and process it as one big document (which is more or less what you do, but by using grep to "look for any character", which is much slower):

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

grep will show the filename that matches the file if more than one file is searched, which is what's happening in your case.

Instead of using grep you may just cat all the text files together and process it as one big document (which is more or less what you do, but by using grep to "look for any character", which is much slower):

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data that whose filename matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

Source Link

answered Feb 14, 2018 at 21:30

Kusalananda ♦

355.8k
42
735
1.1k

Loading

Stack Exchange Network

Return to Answer