Revisions to How to count the lines containing one of two words but not both

[Edit removed during grace period]

Source Link

edited Feb 3, 2021 at 18:16

theCalcaholic

188
7

Remove deprecated note

Source Link

edited Feb 3, 2021 at 18:08

theCalcaholic

188
7

The -E option enables extended expression syntax (ERE) for grep.
The -i option tells grep to match case-insensitive
The -v option tells grep to invert the result (i.e. match lines not containing the pattern)
The -c option tells grep to output the number of matched lines instead of the lines themselves
The patterns:
1. \< matches the beginning of a word (thanks @glenn-jackman@glenn-jackman)
2. \> matches the end of a word (thanks @glenn-jackman@glenn-jackman)
--> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')
1. grep -Evi -e '\<an\>.*\<the\>' thus matches all lines not containing 'an ... the' ~(Note: I did purposefully not include the case "an the" ('the' directly following on 'an' because it is an unlikely case and I wanted to keep the pattern simple. It could, of course, easily be added)~.
  grep -Evi -e '\<an\>.*\<the\>' thus matches all lines not containing 'an ... the'
2. Similarly, grep -Evi -e '\<the\>.*\<an\>' matches all lines not containing 'the ... an'
  Similarly, grep -Evi -e '\<the\>.*\<an\>' matches all lines not containing 'the ... an'
3. grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' is the combination of the 3. and 4.
  grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' is the combination of the 3. and 4.
4. grep -Eci -e '\<(an|the)\>' matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines
  grep -Eci -e '\<(an|the)\>' matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines

The -E option enables extended expression syntax (ERE) for grep.
The -i option tells grep to match case-insensitive
The -v option tells grep to invert the result (i.e. match lines not containing the pattern)
The -c option tells grep to output the number of matched lines instead of the lines themselves
The patterns:
1. \< matches the beginning of a word (thanks @glenn-jackman)
2. \> matches the end of a word (thanks @glenn-jackman)
--> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')
1. grep -Evi -e '\<an\>.*\<the\>' thus matches all lines not containing 'an ... the' ~(Note: I did purposefully not include the case "an the" ('the' directly following on 'an' because it is an unlikely case and I wanted to keep the pattern simple. It could, of course, easily be added)~.
2. Similarly, grep -Evi -e '\<the\>.*\<an\>' matches all lines not containing 'the ... an'
3. grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' is the combination of the 3. and 4.
4. grep -Eci -e '\<(an|the)\>' matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines

The -E option enables extended expression syntax (ERE) for grep.
The -i option tells grep to match case-insensitive
The -v option tells grep to invert the result (i.e. match lines not containing the pattern)
The -c option tells grep to output the number of matched lines instead of the lines themselves
The patterns:
1. \< matches the beginning of a word (thanks @glenn-jackman)
2. \> matches the end of a word (thanks @glenn-jackman)
--> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')
1. grep -Evi -e '\<an\>.*\<the\>' thus matches all lines not containing 'an ... the'
2. Similarly, grep -Evi -e '\<the\>.*\<an\>' matches all lines not containing 'the ... an'
3. grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' is the combination of the 3. and 4.
4. grep -Eci -e '\<(an|the)\>' matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines

Add notice for alternative syntax

Source Link

edited Feb 3, 2021 at 18:03

theCalcaholic

188
7

With grep:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eci -e '\<(an|the)\>'

This counts the matched lines. You can find an alternative syntax which counts the total number of matches down below.

Breakdown:

The frist grep command filters out all lines containing both 'an' and 'the'. The second grep command counts those lines, containing either 'an' or 'the'.

If you remove the c from the second grep's -Eci, you will see all matches highlighted.

Details:

The -E option enables extended expression syntax (ERE) for grep.
The -i option tells grep to match case-insensitive
The -v option tells grep to invert the result (i.e. match lines not containing the pattern)
The -c option tells grep to output the number of matched lines instead of the lines themselves
The patterns:
1. \< matches the beginning of a word (thanks @glenn-jackman)
2. \> matches the end of a word (thanks @glenn-jackman)
--> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')
1. grep -Evi -e '\<an\>.*\<the\>' thus matches all lines not containing 'an ... the' ~(Note: I did purposefully not include the case "an the" ('the' directly following on 'an' because it is an unlikely case and I wanted to keep the pattern simple. It could, of course, easily be added)~.
2. Similarly, grep -Evi -e '\<the\>.*\<an\>' matches all lines not containing 'the ... an'
3. grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' is the combination of the 3. and 4.
4. grep -Eci -e '\<(an|the)\>' matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines

EDIT 1: Use \< and \> instead of ( |^) and ( |$), as suggested by @glenn-jackman

EDIT 2: In order to count the number of matches instead of the number of matched lines, use the following expression:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eio -e '\<(an|the)\>' \
  | wc -l

This uses the -o option of grep, which prints every match in a separate line (and nothing else) and then wc -l to count the lines.

With grep:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eci -e '\<(an|the)\>'

Breakdown:

The frist grep command filters out all lines containing both 'an' and 'the'. The second grep command counts those lines, containing either 'an' or 'the'.

If you remove the c from the second grep's -Eci, you will see all matches highlighted.

Details:

The -E option enables extended expression syntax (ERE) for grep.
The -i option tells grep to match case-insensitive
The -v option tells grep to invert the result (i.e. match lines not containing the pattern)
The -c option tells grep to output the number of matched lines instead of the lines themselves
The patterns:
1. \< matches the beginning of a word (thanks @glenn-jackman)
2. \> matches the end of a word (thanks @glenn-jackman)
--> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')
1. grep -Evi -e '\<an\>.*\<the\>' thus matches all lines not containing 'an ... the' ~(Note: I did purposefully not include the case "an the" ('the' directly following on 'an' because it is an unlikely case and I wanted to keep the pattern simple. It could, of course, easily be added)~.
2. Similarly, grep -Evi -e '\<the\>.*\<an\>' matches all lines not containing 'the ... an'
3. grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' is the combination of the 3. and 4.
4. grep -Eci -e '\<(an|the)\>' matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines

EDIT 1: Use \< and \> instead of ( |^) and ( |$), as suggested by @glenn-jackman

EDIT 2: In order to count the number of matches instead of the number of matched lines, use the following expression:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eio -e '\<(an|the)\>' \
  | wc -l

With grep:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eci -e '\<(an|the)\>'

This counts the matched lines. You can find an alternative syntax which counts the total number of matches down below.

Breakdown:

The frist grep command filters out all lines containing both 'an' and 'the'. The second grep command counts those lines, containing either 'an' or 'the'.

If you remove the c from the second grep's -Eci, you will see all matches highlighted.

Details:

The -E option enables extended expression syntax (ERE) for grep.
The -i option tells grep to match case-insensitive
The -v option tells grep to invert the result (i.e. match lines not containing the pattern)
The -c option tells grep to output the number of matched lines instead of the lines themselves
The patterns:
1. \< matches the beginning of a word (thanks @glenn-jackman)
2. \> matches the end of a word (thanks @glenn-jackman)
--> That way we can make sure to not match words containing 'the' or 'an' (like 'pan')
1. grep -Evi -e '\<an\>.*\<the\>' thus matches all lines not containing 'an ... the' ~(Note: I did purposefully not include the case "an the" ('the' directly following on 'an' because it is an unlikely case and I wanted to keep the pattern simple. It could, of course, easily be added)~.
2. Similarly, grep -Evi -e '\<the\>.*\<an\>' matches all lines not containing 'the ... an'
3. grep -Evi -e '\<an\>.*\<the\>' -e '\<the.*an\>' is the combination of the 3. and 4.
4. grep -Eci -e '\<(an|the)\>' matches all lines containing either 'an' or 'the' (surrounded by whitespace or start/end of line) and prints the number of matched lines

EDIT 1: Use \< and \> instead of ( |^) and ( |$), as suggested by @glenn-jackman

EDIT 2: In order to count the number of matches instead of the number of matched lines, use the following expression:

cat poem.txt \
  | grep -Evi -e '\<an\>.*\<the\>' -e '\<the\>.*\<an\>' \
  | grep -Eio -e '\<(an|the)\>' \
  | wc -l

This uses the -o option of grep, which prints every match in a separate line (and nothing else) and then wc -l to count the lines.

Add solution to the current version of the question

Source Link

edited Feb 3, 2021 at 17:54

theCalcaholic

188
7

Loading

added 94 characters in body

Source Link

edited Feb 3, 2021 at 14:21

theCalcaholic

188
7

Loading

added 1 character in body

Source Link

edited Feb 3, 2021 at 14:14

theCalcaholic

188
7

Loading

Source Link

answered Feb 3, 2021 at 14:05

theCalcaholic

188
7

Loading

Stack Exchange Network

Return to Answer