Regex search for word roots with common prefixes

Question

I recently had a curiosity about words in the dictionary that share both "pro-" and "con-" as a prefix. So, for example, procession/concession, produce/conduce, profess/confess, progress/congress, and so on. I'm basically looking for any words that match both ^pro(.+)$ and ^con(.+)$, where the content of the capture group is the same.

My initial caveman command was:

sed -nr 's/^con(.+)$/\1/Ip' /usr/share/dict/words | \
xargs -I SUFFIX -n1 grep -i '^proSUFFIX$' /usr/share/dict/words

It seems to work, outputting a full "con-" word as long as there exists a matching "pro-" word. Problem is, it's slooow. It invokes grep for every potential match, asking it to scan the whole dictionary each time. I could speed it up by making a temporary file that only has pro/con words in it, but it feels like there must be some efficient way to do this without writing a file.

Is there a tool in the GNU world that's well suited to this kind of intersection search?

Something like egrep '^(pro|con).*' /usr/share/dict/words might do the trick as a starting point, perhaps. You could then put the resultant list through a sed, awk, or grep meatgrinder that only keeps paired words. — DopeGhoti
– DopeGhoti, Commented Jul 28, 2015 at 20:12
Aha! egrep '^(pro|con).* /usr/share/dict/words | sed 's/^...//' | sort | uniq -d will give you a list of all the word-bases that have both a pro and con prefix! — DopeGhoti
– DopeGhoti, Commented Jul 28, 2015 at 20:17

DopeGhoti · Accepted Answer · 2015-07-28 23:34:43Z

3

From my earlier comment to the question itself:

egrep '^(pro|con).* /usr/share/dict/words | sed -nE 's/^(pro|con)(.*)/\2/p' | sort | uniq -d

will give you a list of all the word-bases that have both a pro and con prefix:

The initial egrep grabs all the words with pro and con prefixes. We then use sed to strip off pro and con from the beginning of each word, sort the list, and then use uniq -d to show ony entries on the list that have duplicates.

edited Jul 28, 2015 at 23:34

answered Jul 28, 2015 at 20:20

DopeGhoti

79.2k10 gold badges107 silver badges141 bronze badges

2

The combination of grep and sed is rather unnecessary, as sed can do both steps in one: sed -nr 's/^(pro|con)(.*)/\2/p' | sort | uniq -d

Peter.O
– Peter.O

2015-07-28 22:04:57 +00:00
Commented Jul 28, 2015 at 22:04
I'll update to reflect this - thanks for the tip! Keep in mind that it only works in GNU sed, and not BSD sed.

DopeGhoti
– DopeGhoti

2015-07-28 23:26:35 +00:00
Commented Jul 28, 2015 at 23:26
2

replace -r with -E and it will work with both seds ... also, Peter meant you can drop egrep and use just sed

don_crissti
– don_crissti

2015-07-28 23:32:53 +00:00
Commented Jul 28, 2015 at 23:32

Add a comment |

glenn jackman · Accepted Answer · 2015-07-28 20:17:11Z

0

This will print out the words without the pro|con prefix:

grep '^\(pro\|con\)' /usr/share/dict/words | cut -c 4- | sort | uniq -c | awk '$1 == 2 {print $2}'

answered Jul 28, 2015 at 20:17

glenn jackman

88.5k16 gold badges124 silver badges179 bronze badges

I do like this, as it seems more extensible (for example, if it needed to be modified to match three or more prefixes).

smitelli
– smitelli

2015-07-28 20:25:04 +00:00
Commented Jul 28, 2015 at 20:25

Add a comment |

2 revs · Accepted Answer · 2015-10-04 10:13:02Z

0

In this particular case - sorted input, so all con... words are listed before pro... words - you could use awk to store the lines matching ^con in an array and when you reach the lines matching ^pro, replace pro with con and if the result is in array print the root word:

awk '/^con/{arr[$0]=$0}; /^pro/{c=gensub(/pro/, "con", 1)
if (c in arr) print substr(c, 4)}' /usr/share/dict/words

.....
.....
vince
vinces
vocation
vocation's
vocations
voke
voked
vokes
voking

edited Oct 4, 2015 at 10:13

community wiki

2 revs
don_crissti

Add a comment |

Stack Exchange Network

Regex search for word roots with common prefixes

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Regex search for word roots with common prefixes

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions