Skip to main content
9 events
when toggle format what by license comment
Nov 2, 2023 at 5:17 comment added Lance Pollard Got inspired after learning that Double Metaphone algorithm a little, it's not too bad but it could use improvements. Going to try this out.
Nov 2, 2023 at 3:15 comment added J_H Oh, sorry, I meant to say this before running out of comment characters. The various Soundex derivatives are pretty much language specific, tuned for regional use cases. That is, they pay attention to what would be a homophone in the target language. So I'm skeptical that they would be a good fit for your multi-language use case, especially since you target languages from tree branches that are pretty far apart from one another.
Nov 2, 2023 at 3:12 comment added Lance Pollard Interesting, will have to think about what you're suggesting more. Digging more into the metaphone algorithm, I'm basically now wondering why the Double Metaphone algorithm chose to substitute and merge consonants, so been digging into that but not finding much.
Nov 2, 2023 at 2:50 comment added J_H Ummm, I'm looking at your revised Question, which mentions several languages. Wow, those are hard languages! In the sense that I don't see a bunch of {Germanic, Romance, Slavic} tongues. You mentioned a pair of BIDI languages which read right-to-left, which I guess is not a deal breaker. Ideograms don't fit into the whole Soundex-and-derivatives at all. Sanskrit seems like it could plausibly work. But you might be happier with word2vec embeddings. Maybe a "sounds alike" typo homophone kind of embedding? Then the Sanskrit, Hebrew, and Chinese words for "dog" would be near, in embedding space.
Nov 2, 2023 at 2:40 history edited J_H CC BY-SA 4.0
edited body
Nov 2, 2023 at 2:40 vote accept Lance Pollard
Nov 2, 2023 at 2:37 comment added J_H Yes, and yes. The code I offered mostly ignores the collision aspect due to the "last one wins!" overwriting. Certainly one could maintain a list of colliding words. Prolly you would want to augment that with word frequencies, leading to a "most frequent word wins!" rule. // I have a hard time seeing how "autocomplete" is a good use case for Metaphone or similar Soundex schemes. You kind of need to wait for 80% of a word's characters to be entered before you can usefully probe the dict. Autocorrect spell check? Yeah, maybe that would work OK. An indexed RDBMS table would be a good fit.
Nov 2, 2023 at 2:33 comment added Lance Pollard So is it fair to say, if you were to use this for autocomplete or spell check, you would convert the input word to metaphone hash, then fetch all the words which have that same hash? Is it that easy? Doesn't that cause false matches is what I'm wondering next. And you have to have clean input data to start, to do your "cleanup" of product data it looks like in your example, so that's a barrier to entry I guess. Updated the question with my basic use-case pretty much.
Nov 2, 2023 at 2:29 history answered J_H CC BY-SA 4.0