How can I search for words that start and end with the same character in a file by using the Linux grep command? I have tried some answers but they didn't work. Thanks!
-
7"I have tried some answers but they didn't work" - next time please show us what you've tried. You might have been really close to a solution and we can build on your ideas to reach something that worksChris Davies– Chris Davies2022-10-07 08:00:53 +00:00Commented Oct 7, 2022 at 8:00
3 Answers
Assuming the input contains a single word per line, you may use
grep -x '\(.\).*\1' file
... to extract all lines that start and end with the same character. This is done by capturing the first character on the line using \(.\), allowing the rest of the characters on the line to be anything (with .*) but then forcing a match of the captured character at the very end using the back-reference \1.
The -x option to grep tells the utility that the pattern must match across a complete line, not just a part of the line. Without -x, you would have to insert explicit anchors in the regular expression to be sure you match complete lines: ^\(.\).*\1$
Example run on my system's dictionary, showing only the 5 first results:
$ grep -x '\(.\).*\1' /usr/share/dict/words | head -n 5
aa
aba
abaca
abasia
abepithymia
If you're dealing with input that contains multiple space-delimited words on each line, then you may pre-process that text by splitting it into one word per line first. Here, I additionally convert all characters to lower-case with tr at the same time as replacing spaces with newlines, and I remove duplicates by means of sort -u:
tr ' [:upper:]' '\n[:lower:]' <file | sort -u | grep -x '\(.\).*\1'
Note that this ignores the fact that an "ordinary text" may contain punctuation and other characters that are not part of words.
It is pointed out in comments (now deleted) that the grep command misses single-letter words, which technically starts and end with the same character.
To get these too:
grep -x -e '\(.\).*\1' -e . file
This now returns lines starting and ending with the same character or lines only containing a single character.
-
What about using
awkand use split or substr for every line? So I can compare the first character with the last one, is it slower perhaps?Edgar Magallon– Edgar Magallon2022-10-07 07:44:43 +00:00Commented Oct 7, 2022 at 7:44 -
Great answer, I thought it was not possible to use
grepin this case, but I see I was wrong :).Edgar Magallon– Edgar Magallon2022-10-07 07:46:38 +00:00Commented Oct 7, 2022 at 7:46 -
1@EdgarMagallon Sure, you can do that, but it's going to be a bit more fiddly, more code. I would probably not do that, unless I had to do it as part of an existing
awkprogram.2022-10-07 07:48:06 +00:00Commented Oct 7, 2022 at 7:48 -
Thank you. I also tried '^(\w).*\1$'. Seems it works too. Are there many ways to capture the first character? What is the difference between \(.\) and (\w)?newlearner– newlearner2022-10-07 07:54:59 +00:00Commented Oct 7, 2022 at 7:54
-
1@newlearner Yes, you may ask that, in a separate question (if it hasn't been asked before, I haven't checked). That way we allow others to answer too.2022-10-07 08:46:35 +00:00Commented Oct 7, 2022 at 8:46
If by word, you mean any sequence of one or more non-whitespace characters, with GNU grep you could do:
grep -Po '(?<!\S)(?=(\S))\S*\1(?!\S)' your-file
That matches on sequences of 0 or more non-whitespace characters (\S*) that end in the same non-whitespace character (\1) as was captured ((\S)) in a look-ahead operator ((?=...)) at the start, using negative look-behind ((?<!...)) and look-ahead ((?!...)) operators on either side to make sure the found word is neither preceded nor followed by non-whitespace characters.
In this answer, it finds:
'(?<!\S)(?=(\S))\S*\1(?!\S)'
sequences
0
that
a
It also finds That if you add the -i option.
Using Raku (formerly known as Perl_6)
~$ raku -ne '.put if m:i/ ^ (.) .*? $0 $ /;' file
File is read linewise using the -ne non-autoprinting command line flags. Captures in Raku are denoted by (…) parentheses, and start from $0. The match is made case-insensitive with the :i adverb.
~$ cat /usr/share/dict/words | raku -e 'my @a; @a.push($_) if / ^ (.) .*? $0 $ / for $*IN.lines; .put for @a.elems;'
9917
(Remove the call to elems above to return a list of matching words).