2

I'm looking for fuzzy string algorithms for the following example: given a database of existing names, match inputs to either the best-matched name if the match accuracy is higher than the input threshold (say 90%), or NA otherwise

database = [James Bond, Michael Smith]

input

James L Bond->James Bond
JBondL->James Bond
Bond,James->James Bond
BandJamesk->James Bond
Jenny,Bond->N/A

Currently, most algorithms like Levenstein and phonetic based ones like Soundex can't match inverted names like BondJames. So far cosine and Jacquard yield the best results, but I'm looking for more, so that I can choose the best or possibly combine algorithms.

1 Answer 1

5

Given your examples, I would consider:

  • Separating n1 - the name in the input and n2 - a name in the database into words (by delimiters and capital letters): n1 -> {u1,u2,...}, n2 -> {v1,v2,...}
  • Finding the permutation of the order of words in n2 that minimizes s = sum(L(u, v)) where L is the Levenshtein distance.
  • Selecting the database entry that minimizes s.

When the number of words in L1 and the number of words in L2 don't match - you should 'penalize' s.

Sign up to request clarification or add additional context in comments.

3 Comments

great idea Lior, thank you. A question: how would you go about separating jamesbond, without the benefit of commas or cap letters?
A way I could think of is maybe an algorithm that allows for 1-2 swapping/relocations of groups of characters without any penalty. (Max 2 because there might be three groups -herbertbondjames, for instance) I can't think of an existing algorithm that does this, or any other algorithms that can take into account this criteria of bondjames -> jamesbond.
When the input name has one word, instead of separating it, I would concatenate the words composing the database name in different permutations, and minimize L(u, v1||v2||...)). (''||' denotes concatenation).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.