4

Why does grep output lines that seemingly don't match the expression?

As mentioned in my comment this behaviour may be caused by a bug.

I am aware different locales affect character order but I thought the -o output below confirms this is not a problem here but I was wrong. Adding LC_ALL=C gives expected output.

I had this question after I saw locales affected the output.

[aa@bb grep-test]$ cat input.txt
aa bb
CC cc
dd ee

[aa@bb grep-test]$ LC_ALL=C grep -o [A-Z] input.txt
C
C
[aa@bb grep-test]$ grep -o [A-Z] input.txt
C
C
[aa@bb grep-test]$ LC_ALL=C grep [A-Z] input.txt
CC cc
[aa@bb grep-test]$ grep [A-Z] input.txt
aa bb
CC cc
dd ee
[aa@bb grep-test]$





[aa@bb tmp]$ cat test
aa bb
CC cc
dd ee

[aa@bb tmp]$ grep [A-Z] test
aa bb
CC cc
dd ee
[aa@bb tmp]$ grep -o [A-Z] test
C
C
[aa@bb tmp]$ grep -E [A-Z] test
aa bb
CC cc
dd ee
[aa@bb tmp]$ grep -n [A-Z] test
1:aa bb
2:CC cc
3:dd ee
[aa@bb tmp]$ echo [A-Z]
[A-Z]
[aa@bb tmp]$ grep -V
GNU grep 2.6.3
...
[aa@bb tmp]$ bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
...
[aa@bb grep-test]$ command -v grep
/bin/grep
[aa@bb grep-test]$ rpm -q -f $(command -v grep)
grep-2.6.3-6.el6.x86_64
[aa@bb grep-test]$ echo grep [A-Z] input.txt | xxd
0000000: 6772 6570 205b 412d 5a5d 2069 6e70 7574  grep [A-Z] input
0000010: 2e74 7874 0a                             .txt.    
[aa@bb grep-test]$ cmd='grep [A-Z] input.txt'; echo $cmd | xxd; eval $cmd
0000000: 6772 6570 205b 412d 5a5d 2069 6e70 7574  grep [A-Z] input
0000010: 2e74 7874 0a                             .txt.
aa bb
CC cc
dd ee
[aa@bb grep-test]$ xxd input.txt
0000000: 6161 2062 620a 4343 2063 630a 6464 2065  aa bb.CC cc.dd e
0000010: 650a 0a                                  e..
[aa@bb grep-test]$
1
  • Comments are not for extended discussion; this conversation has been moved to chat. Commented Jan 19, 2017 at 8:58

3 Answers 3

3

Your 'A' isn't latin 'A':

cmd='grep [A-Z] test'; echo $cmd | xxd; eval $cmd
0000000: 6772 6570 205b ***41***2d 5a5d 2074 6573 740a  grep [A-Z] test.
CC cc


cmd='grep [А-Z] test'; echo $cmd | xxd; eval $cmd
0000000: 6772 6570 205b ***d090*** 2d5a 5d20 7465 7374  grep [..-Z] test
0000010: 0a                                       .
aa bb
CC cc
dd ee
3
  • added output to question. When I type A, it looks like ASCII 41 Commented Jan 18, 2017 at 20:38
  • I've edited my post and updated command to check result, please check with new command. Also please show content of your test file with xxd test. Commented Jan 18, 2017 at 20:47
  • Done. Correct me if I'm wrong but this doesn't seem to be my problem. May be for others though so good answer Commented Jan 18, 2017 at 21:00
2

This looks like your locale collation rules being very ... helpful.

Try it with

LC_ALL=C grep [A-Z] input.txt

to test that idea.

I have

export LANG=en_US.UTF-8
export LC_COLLATE=C
export LC_NUMERIC=C

in my shell startup to avoid this kind of trouble while still getting my unicode goodness.

6
  • 1
    It seems that the locale collation rules sort AaBbCcDdEe...Zz, which means that [A-Z] contains all uppercase letters and almost all lowercase letters except z. Commented Jan 18, 2017 at 22:28
  • Depends on the locale, in LC_ALL=en_US.UTF-8 the sorting is aAzZ and the grep only matches the A and Z. Commented Jan 18, 2017 at 22:45
  • * it also matches the stuff between, the point being [A-Z] doesn't match any lowercase, its range-matching logic is not the same as its sorting logic Commented Jan 19, 2017 at 0:31
  • As mentioned in another comment this behaviour may be caused by a bug. Commented Jan 19, 2017 at 2:09
  • It's not necessarily a bug. Collation sorting and matching are ticklish tasks, that can depend on locale, user preference, and purpose. If grep wants to interpret [A-Z] as a case-sensitive search it's completely correct to not match lowercase letters regardless of where they might fall in the sort order. Commented Jan 19, 2017 at 3:20
2

Unable to replicate with either BSD or GNU grep:

BSD:

$ cat test
aa bb
CC cc
dd ee
$ grep [A-Z] test
CC cc
$ grep --version
grep (BSD grep) 2.5.1-FreeBSD

GNU:

$ cat test
aa bb
CC cc
dd ee
$ grep [A-Z] test
CC cc
$ grep --version
grep (GNU grep) 2.25
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
1
  • You should be able to reproduce it with GNU grep 2.6.3 and some earlier versions Commented Jan 19, 2017 at 2:05

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.