Searching words start and end with the same character with Linux grep command

Question

How can I search for words that start and end with the same character in a file by using the Linux grep command? I have tried some answers but they didn't work. Thanks!

"I have tried some answers but they didn't work" - next time please show us what you've tried. You might have been really close to a solution and we can build on your ideas to reach something that works — Chris Davies
– Chris Davies, Commented Oct 7, 2022 at 8:00

Kusalananda · Accepted Answer · 2022-10-07 08:25:03Z

2

Assuming the input contains a single word per line, you may use

grep -x '\(.\).*\1' file

... to extract all lines that start and end with the same character. This is done by capturing the first character on the line using $.$, allowing the rest of the characters on the line to be anything (with .*) but then forcing a match of the captured character at the very end using the back-reference \1.

The -x option to grep tells the utility that the pattern must match across a complete line, not just a part of the line. Without -x, you would have to insert explicit anchors in the regular expression to be sure you match complete lines: ^$.$.*\1$

Example run on my system's dictionary, showing only the 5 first results:

$ grep -x '\(.\).*\1' /usr/share/dict/words | head -n 5
aa
aba
abaca
abasia
abepithymia

If you're dealing with input that contains multiple space-delimited words on each line, then you may pre-process that text by splitting it into one word per line first. Here, I additionally convert all characters to lower-case with tr at the same time as replacing spaces with newlines, and I remove duplicates by means of sort -u:

tr ' [:upper:]' '\n[:lower:]' <file | sort -u | grep -x '\(.\).*\1'

Note that this ignores the fact that an "ordinary text" may contain punctuation and other characters that are not part of words.

It is pointed out in comments (now deleted) that the grep command misses single-letter words, which technically starts and end with the same character.

To get these too:

grep -x -e '\(.\).*\1' -e . file

This now returns lines starting and ending with the same character or lines only containing a single character.

edited Oct 7, 2022 at 8:25

answered Oct 7, 2022 at 7:33

Kusalananda♦

356k42 gold badges735 silver badges1.1k bronze badges

What about using awk and use split or substr for every line? So I can compare the first character with the last one, is it slower perhaps?

Edgar Magallon
– Edgar Magallon

2022-10-07 07:44:43 +00:00
Commented Oct 7, 2022 at 7:44
Great answer, I thought it was not possible to use grep in this case, but I see I was wrong :).

Edgar Magallon
– Edgar Magallon

2022-10-07 07:46:38 +00:00
Commented Oct 7, 2022 at 7:46
1

@EdgarMagallon Sure, you can do that, but it's going to be a bit more fiddly, more code. I would probably not do that, unless I had to do it as part of an existing awk program.

Kusalananda
– Kusalananda ♦

2022-10-07 07:48:06 +00:00
Commented Oct 7, 2022 at 7:48
Thank you. I also tried '^(\w).*\1$'. Seems it works too. Are there many ways to capture the first character? What is the difference between $.$ and (\w)?

newlearner
– newlearner

2022-10-07 07:54:59 +00:00
Commented Oct 7, 2022 at 7:54
1

@newlearner Yes, you may ask that, in a separate question (if it hasn't been asked before, I haven't checked). That way we allow others to answer too.

Kusalananda
– Kusalananda ♦

2022-10-07 08:46:35 +00:00
Commented Oct 7, 2022 at 8:46

| Show 4 more comments

Stéphane Chazelas · Accepted Answer · 2022-10-07 09:30:56Z

If by word, you mean any sequence of one or more non-whitespace characters, with GNU grep you could do:

grep -Po '(?<!\S)(?=(\S))\S*\1(?!\S)' your-file

That matches on sequences of 0 or more non-whitespace characters (\S*) that end in the same non-whitespace character (\1) as was captured ((\S)) in a look-ahead operator ((?=...)) at the start, using negative look-behind ((?<!...)) and look-ahead ((?!...)) operators on either side to make sure the found word is neither preceded nor followed by non-whitespace characters.

In this answer, it finds:

'(?<!\S)(?=(\S))\S*\1(?!\S)'
sequences
0
that
a

It also finds That if you add the -i option.

jubilatious1 · Accepted Answer · 2022-10-08 07:20:02Z

0

Using Raku (formerly known as Perl_6)

~$ raku -ne '.put if m:i/ ^ (.) .*? $0 $ /;'  file

File is read linewise using the -ne non-autoprinting command line flags. Captures in Raku are denoted by (…) parentheses, and start from $0. The match is made case-insensitive with the :i adverb.

~$ cat /usr/share/dict/words | raku -e 'my @a; @a.push($_) if / ^ (.) .*? $0 $ / for $*IN.lines; .put for @a.elems;'
9917

(Remove the call to elems above to return a list of matching words).

https://raku.org

edited Oct 8, 2022 at 7:20

answered Oct 8, 2022 at 7:14

jubilatious1

3,90310 silver badges20 bronze badges

Add a comment |

Stack Exchange Network

Searching words start and end with the same character with Linux grep command

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Searching words start and end with the same character with Linux grep command

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions