Timeline for What do you do with the hash code after running a word through the Double Metaphone algorithm used in fuzzy text search?

Current License: CC BY-SA 4.0

9 events

when toggle format	what		by	license	comment
Nov 2, 2023 at 5:17	comment	added	Lance Pollard		Got inspired after learning that Double Metaphone algorithm a little, it's not too bad but it could use improvements. Going to try this out.
Nov 2, 2023 at 3:15	comment	added	J_H		Oh, sorry, I meant to say this before running out of comment characters. The various Soundex derivatives are pretty much language specific, tuned for regional use cases. That is, they pay attention to what would be a homophone in the target language. So I'm skeptical that they would be a good fit for your multi-language use case, especially since you target languages from tree branches that are pretty far apart from one another.
Nov 2, 2023 at 3:12	comment	added	Lance Pollard		Interesting, will have to think about what you're suggesting more. Digging more into the metaphone algorithm, I'm basically now wondering why the Double Metaphone algorithm chose to substitute and merge consonants, so been digging into that but not finding much.
Nov 2, 2023 at 2:50	comment	added	J_H		Ummm, I'm looking at your revised Question, which mentions several languages. Wow, those are hard languages! In the sense that I don't see a bunch of {Germanic, Romance, Slavic} tongues. You mentioned a pair of BIDI languages which read right-to-left, which I guess is not a deal breaker. Ideograms don't fit into the whole Soundex-and-derivatives at all. Sanskrit seems like it could plausibly work. But you might be happier with word2vec embeddings. Maybe a "sounds alike" typo homophone kind of embedding? Then the Sanskrit, Hebrew, and Chinese words for "dog" would be near, in embedding space.
Nov 2, 2023 at 2:40	history	edited	J_H	CC BY-SA 4.0	edited body
Nov 2, 2023 at 2:40	vote	accept	Lance Pollard
Nov 2, 2023 at 2:37	comment	added	J_H		Yes, and yes. The code I offered mostly ignores the collision aspect due to the "last one wins!" overwriting. Certainly one could maintain a `list` of colliding words. Prolly you would want to augment that with word frequencies, leading to a "most frequent word wins!" rule. // I have a hard time seeing how "autocomplete" is a good use case for Metaphone or similar Soundex schemes. You kind of need to wait for 80% of a word's characters to be entered before you can usefully probe the `dict`. Autocorrect spell check? Yeah, maybe that would work OK. An indexed RDBMS table would be a good fit.
Nov 2, 2023 at 2:33	comment	added	Lance Pollard		So is it fair to say, if you were to use this for autocomplete or spell check, you would convert the input word to metaphone hash, then fetch all the words which have that same hash? Is it that easy? Doesn't that cause false matches is what I'm wondering next. And you have to have clean input data to start, to do your "cleanup" of product data it looks like in your example, so that's a barrier to entry I guess. Updated the question with my basic use-case pretty much.
Nov 2, 2023 at 2:29	history	answered	J_H	CC BY-SA 4.0