1

I am using the following Linux command to recursively count lines of text files in a folders structure:

find . -name '*.txt' | xargs -d '\n' wc -l

This outputs all found files and their number of lines:

  86 ./folder1/folder11/folder111/file1.txt
  67 ./folder1/folder11/folder112/file2.txt
7665 ./folder1/folder11/folder113/file3.txt
..., etc.
1738958 total

There are a total of 24k+ files. The number of lines for each file are correct and all files there are possessed. But the total number of lines is not correct. Even for a sub-folder of this structure the total number of lines is much bigger. For example:

cd folder1/folder11
find . -name '*.txt' | xargs -d '\n' wc -l

gives at the end 23M lines:

22535346 total

The total number of all lines should be > 100M, not 1.7M. What I am missing here?

2
  • 2
    Are there some other "total" lines that appear part-way through the output? Commented Oct 24, 2019 at 9:50
  • Yes, this is the issue. There is not a global total but several local totals. Commented Oct 24, 2019 at 14:16

2 Answers 2

4

If you have GNU wc, use

find . -name "*.txt" -print0 | wc -l --files0-from -

The manual section for this option explains why what you were doing doesn't work:

‘--files0-from=file’

Disallow processing files named on the command line, and instead process those named in file file; each name being terminated by a zero byte (ASCII NUL). This is useful when the list of file names is so long that it may exceed a command line length limitation. In such cases, running wc via xargs is undesirable because it splits the list into pieces and makes wc print a total for each sublist rather than for the entire list. One way to produce a list of ASCII NUL terminated file names is with GNU find, using its -print0 predicate. If file is ‘-’ then the ASCII NUL terminated file names are read from standard input.

If your wc doesn't support this option, you could instead send the output through a simple script to extract all the "total" lines and add them up.

... | awk '$2=="total"{t=t+$1} END{print t " total"}'
3
  • Now I know that the corpus has 64027131. This same count was return by your solution and the one of Kusalananda. Commented Oct 24, 2019 at 14:25
  • i've done that wc | awk totalling thing many times over the years, but there is a small problem with it. wc provides no way to distinguish between actual totals and files called total - if there are any files called total, their lines will be counted twice. both --files0-from and kusalananda's -exec cat ... method work great, though. the former is good if you want a line count for each file as well as a total, cat if you only care about the final total. Commented Oct 25, 2019 at 2:36
  • In my tests, find . prints ./total for a file named total, and that's enough for awk to distinguish the genuine total. Commented Oct 25, 2019 at 7:27
3

Since you have so many files, what is happening is, I presume, that wc -l is being run on batches on the files by xargs. This is essentially what xargs is for; a single invocation of wc -l on all the files at once would not work as the command would be too big. The result that you are seeing is for the last batch. If you scroll up a few thousand lines or so, you will eventually see the result for the previous batch.

If you're just after the total number of lines in all the files, you can cat them all and send that data to wc -l:

find . -type f -name '*.txt' -exec cat {} + | wc -l

This would execute cat on batches of found files, and then pass the resulting data stream to wc -l.

1
  • Thanks! Your solution is working well and what you say is exactly what is happening. There were several local totals. I have accepted JigglyNaga answer since chronologically he was first. Commented Oct 24, 2019 at 14:18

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.