Count lines ending in "*"

Question

I have several files in a directory with this kind of content:

Wood *
Nails
Large Hammer *

Some names have a star after them, some don't. I have multiple files with such content. In each file a product may or may not have a single star next to it. I need to make a bash script to count the number of star occurrences for each individual product in all the files. For example, the output needs to be like this:

Wood 12
Yellow Lamps 6
Nails 4
...

Which means that in all the files it found 12x a star next to Wood, 6x a star next to the lamps, etc...

It's pretty easy to parse it in C, but I don't want a binary to run. I want a shell script, and I'm not as versatile with grep and awk, which I'm sure I need here.

I know how to count the stars per se, but I'm not sure how to track which star count belongs to which product.

What about Wood * moretext? should be considered Wood as another occurrence? Or what about Wood buzz *? Wood buzz should be considered as another occurrence too> — Edgar Magallon
– Edgar Magallon, Commented Dec 8, 2022 at 0:46
After the star there is no more text. An occurence of 'Wood buzz *' should not count toward 'Wood'. It should be it's own product. 'Wood buzz' /= 'Wood' — math101
– math101, Commented Dec 8, 2022 at 0:52
If a product had no star in any file, say 'Lemon', should it be excluded from the output or should it be present as 'Lemon 0'? — seshoumara
– seshoumara, Commented Dec 8, 2022 at 3:55
If you want help writing a script to count occurrences of strings then provide sample input/output that has multiple occurrences of strings. If you want help writing a script that acts on the appearance or absence of some string/character (e.g. *) then provide sample input/output that includes some lines that do and some that don't have that string/character. We can't test a potential solution using the example you provided where every string from the input appears in the output, there's only 1 occurrence of each string you want counted, and the output counts don't come from the input. — Ed Morton
– Ed Morton, Commented Dec 8, 2022 at 12:17
Please edit your question and clarify: Can there be stars at any other position? — Bodo
– Bodo, Commented Dec 8, 2022 at 18:44

Gilles Quénot · Accepted Answer · 2022-12-09 00:49:04Z

5

Like this, with one awk:

awk '$NF=="*"{$NF=""; arr[$0]++}END{for (i in arr) print i arr[i]}' ./*

$NF is the latest string separated by space(s) by default
the main trick is to create an associative named array with the current words as key and incrementing as value
at the END we iterate over the array to print each keys/values

With perl one-liner:

perl -anE '
    if ($F[-1] eq "*") {
        $k = join " ", @F[0..@F-2];
        $a->{$k}++
    }
    END{say "$_ $a->{$_}" for keys %$a}
' ./*

The -a is the split mode in @F default array

edited Dec 9, 2022 at 0:49

answered Dec 8, 2022 at 1:19

Gilles Quénot

36.6k7 gold badges74 silver badges97 bronze badges

1

That's better! +1. In this case it should be * instead of file because the user has several files.

Edgar Magallon
– Edgar Magallon

2022-12-08 01:28:49 +00:00
Commented Dec 8, 2022 at 1:28
Or **/* if the OP has subdirectories.

Edgar Magallon
– Edgar Magallon

2022-12-08 01:30:47 +00:00
Commented Dec 8, 2022 at 1:30

Add a comment |

Stéphane Chazelas · Accepted Answer · 2022-12-08 18:59:31Z

4

You could do:

sed -n 's/[[:blank:]]*\*$//p' ./* |
  LC_ALL=C sort |
  LC_ALL=C uniq -c |
  sort -rn

Which removes the <blanks>* at the end of the lines (and prints only the lines where there has been such a substitution) and use sort | uniq -c to count the unique lines (in the C locale for it to be a byte-to-byte comparison).

answered Dec 8, 2022 at 18:59

Stéphane Chazelas

585k96 gold badges1.1k silver badges1.7k bronze badges

Add a comment |

Edgar Magallon · Accepted Answer · 2022-12-09 01:00:38Z

3

I'm not sure if this can affect the performance (if you have very larges files I would think this command should be slow):

grep -Fh '*' | tr -s ' ' | sort | uniq -c

More portable:

grep -Fh '*' * 2>/dev/null | tr -s ' ' | sort | uniq -c

And if you have sub-directories with more files you want to search inside:

grep -Fh '*' **/* 2>/dev/null | tr -s ' ' | sort | uniq -c | sed 's/.$//'

Or to avoid using 2>/dev/null:

find . -type f -exec grep -Fh '*' {} + | tr -s ' ' | sort | uniq -c | sed 's/.$//'

The section grep -Fh '*' means that will match any line which has a * at the end of this one. -h suppress printing the filenames whose matches the pattern and -F is for using literal strings (the '*' behaves as a string and not as pattern).
About tr -s ' ' I'm removing repeated spaces between every line, for example having this:

Need *
Word   buzz *
Need *
More   *
More *
Word   *
More   *
More *
Word   *
Word   *
Need *
More *

the tr command will parse it to:

Need *
Word buzz *
Need *
More *
More *
Word *
More *
More *
Word *
Word *
Need *
More *

The content above is piped to sort to have this output:

More *
More *
More *
More *
More *
Need *
Need *
Need *
Word *
Word *
Word *
Word buzz *

And finally with uniq -c I'm prefixing lines by the number of occurrences of every word which is what you want.

The sort command is important, if you do not use it, the expected result will be different

According to the output above, the final output (by using uniq -c) will be:

5 More *
3 Need *
3 Word *
1 Word buzz *

If you want to remove the * you can pipe to sed to remove the last character or *:

grep -Fh '*'  * | tr -s ' ' | sort | uniq -c | sed 's/.$//'
#or
grep -Fh '*' * | tr -s ' ' | sort | uniq -c | sed 's/\*//'

I think and hope there are better ways to achieve that, because here I'm using several commands to get the desired output. So as I said it may result in slow performance.

edited Dec 9, 2022 at 1:00

answered Dec 8, 2022 at 1:02

Edgar Magallon

5,1353 gold badges15 silver badges29 bronze badges

1

@EdMorton thanks so much! Edited accordingly. I did not know about that, that's a very good observation and information. Actually I do not know much about POSIX compliant, so I'm not sure if the code I use is good for portability or not. I appreciate these comments :).

Edgar Magallon
– Edgar Magallon

2022-12-08 15:07:59 +00:00
Commented Dec 8, 2022 at 15:07
1

You're welcome. FYI grep -R and tr --squeeze-repeats aren't portable (use -s instead of --squeeze-repeats, and IMHO no-one should ever use -R - use find to find files, not grep, the GNU guys really broke with UNIX philosophy by giving grep options to do the same things as find already does).

Ed Morton
– Ed Morton

2022-12-08 15:33:07 +00:00
Commented Dec 8, 2022 at 15:33
1

@EdMorton thanks again! I edited to include tr -s instead. About grep -R I will consider for future answers/uses (to avoid expand the script here, which is already some large to solve the problem). I never thought that grep -R was not portable , it surprises me.

Edgar Magallon
– Edgar Magallon

2022-12-08 16:11:23 +00:00
Commented Dec 8, 2022 at 16:11
1

Yeah -R is a poorly-thought-out GNUism. The UNIX philosphy is to have tools that do 1 thing and do it well working together when necessary. The tool to find files is named find, and the tool to g/re/p (the ed commands to Globally match a Regular Expression in a file and Print the result) is named grep. Giving grep the ability to also find files absolutely breaks that philosophy - why not give find, sed, awk, cut, tr and every other tool options to find files too? Why not also give grep options to replace text or sort it's output or translate characters? It's a nasty mistake.

Ed Morton
– Ed Morton

2022-12-08 16:15:37 +00:00
Commented Dec 8, 2022 at 16:15
1

No because in cases where POSIX doesn't define behavior GNU might do X while MacOS does Y. For example the meaning of print > "foo" 17 in awk is undefined by POSIX - GNU awk, with or without --posix, will treat it like print > ("foo" 17) but MacOS would treat it like (print > "foo") 17 and report it as a syntax error. All you CAN guarantee that --posix does is disable GNU extensions e.g. gensub() or multi-char RS because those things have a different, defined meaning to POSIX.

Ed Morton
– Ed Morton

2022-12-08 16:53:38 +00:00
Commented Dec 8, 2022 at 16:53

| Show 8 more comments

seshoumara · Accepted Answer · 2022-12-08 06:16:34Z

Using bash or just awk is recommended, but I liked the challenge of doing it in (GNU) sed.

s:  *: :g
/\*$/!s:$: :
G
s:([^\n]+) (\*?)(.*\n)\1 (\**)\n:\3\1 \4\2\n:
s:^\n::
h;$!d
s:\n$::
:u2d
    s:\*:<<123456789*01>:m
    s:(.)<.*\1(\**.).*>:\2:m
tu2d

I tested with the two input files below (vim display); first one from Edgar Magallon's answer:

Need *         |Need
Word   buzz *  |Word   buzz
Need *         |Need
More   *       |More *
More *         |More *
Word   *       |Word
More   *       |More *
More *         |More *
Word   *       |Word
Word   *       |Word
Need *         |Need
More *         |More *
~              |~
~              |~
input1          input2

Result:

~$ sed -rf script.sed input1 input2
Word 3
More 10
Word buzz 1
Need 3

Stack Exchange Network

Count lines ending in "*"

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Count lines ending in "*"

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions