0

I am trying to count the number of lines in large files with length of the line less than 300 characters.

My current approach to do this is with following command(but it is slow):

awk "length<=300" *.log | wc -l

Is there a better way to get only the count of the lines?

2
  • Is your input contains all single byte characters? or it's can contain unicode characters which are double bytes? if unicode then you want count each of characters or characters bytes? Commented Jun 3, 2022 at 11:35
  • They are UTF-8 files Commented Jun 3, 2022 at 11:56

3 Answers 3

4

use awk to count line

awk 'length<=300{c++} END { print c }' *.log

where

  • c++ increment counter
  • END { print c } is executed after last line and print c value.

I am not sure, this will be faster (at least wc -l won't have to count and parse lines)


to get subtotal (can be one lined)

awk 'length<=300{t++;s++} 
     ENDFILE { printf "%s:%d\n",FILENAME,s ; s=0 ; } 
     END { printf "TOTAL:%d\n",t }' *.log
2
  • On my data set this command was faster than using the solution with grep Commented Jun 3, 2022 at 12:54
  • You should mention that to get the subtotal that way requires GNU awk for ENDFILE. Also, printf "TOTAL:%s\n",t should be printf "TOTAL:%d\n",t so you get numeric output (0 vs null) even if no lines are shorter than 301 chars. Commented Jun 3, 2022 at 16:56
3

With grep:

cat *.log | grep -vc '^.\{301\}'

To match lines with length <=300 we grep with -v (invert match) for any 301 characters, as the search pattern is limited to one line for grep. Pattern is anchored at the beginning of the line with ^. And -c counts the matching lines.


If you want to have some basic progress indicator, you can use pv from package moreutils:

pv *.log | grep -vc '^.\{301\}'

If you want to get line number per file:

grep -vc '^.\{301\}' *.log

and if you want to get the total from the above command:

grep -vc '^.\{301\}' *.log | awk -F':' '{c+=$NF} END {print c}'

Depending on the data, although we don't usually pipe grep with awk, it could be faster than cat & grep, if there are many very long input lines, the pipe here is used just for a small amount of data, numbers and filenames.

4
  • Both solutions works well. I like yours also because I can see per file the count Commented Jun 3, 2022 at 11:57
  • 1
    @thanasisp, grep would print lines like hello.txt:1, foo.txt:3 if it's given multiple filenames. cat *.log | grep ... would give the total, though Commented Jun 3, 2022 at 12:04
  • I used this: grep -vc '^.\{301\}' *.log | awk -F: '{s+=$2} END {print s}' , but in case if you need per file the count: grep -vc '^.\{301\}' *.log > 300grep.txt and then awk Commented Jun 3, 2022 at 12:24
  • I have ~ 60GB of logs to search in Commented Jun 3, 2022 at 12:25
0

Using Raku (formerly known as Perl_6)

Dependent on shell-globbing:

raku -ne 'state $i; $i++ if .chars <= 300; END say $i // 0;'

#OR

raku -ne 'state $i; if .chars <= 300 {$i++}; END say $i // 0;'

Files determined via regex (independent of shell-globbing):

raku -e 'for dir(test => / .+ \.log $ /) {state $i; $i++ if .chars <= 300 for .lines; END say $i // 0};'

https://docs.raku.org/syntax/state
https://docs.raku.org/routine/dir
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.