I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:
+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+
Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.
+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+
I wrote a script in Python that does the following for each of the companies in my sample:
- Connect to the database and filter the
target_dbto only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4 - Calculate levenshtein with all possible matches (using fuzzywuzzy library)
- Choose the match with the highest similarity
This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.
I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:
SELECT distinct input.name,
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7
ORDER BY levenshtein asc
Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?