0

I have a set of similar sounding names in 2 columns in pandas data frame, which i am doing fuzzy matching using fuzzywuzzy library in python.

import pandas as pd
from fuzzywuzzy import fuzz

datt = pd.read_csv("H:\\FuzzyMatch\\data.csv")

#add column names for each library
datt['ratio'] = ""
datt['partial_ratio'] = ""
datt['partial_token_set_ratio'] = ""
datt['partial_token_sort_ratio'] = ""
datt['QRatio'] = ""
datt['token_set_ratio'] = ""
datt['token_sort_ratio'] = ""
datt['UQRatio'] = ""
datt['UWRatio'] = ""
datt['WRatio'] = ""

#save score
for i in range(datt.shape[0]):
    datt.ratio.loc[i] = fuzz.ratio(datt.current_company[i],datt.crm_company_name[i])
    datt.partial_ratio.loc[i] = fuzz.partial_ratio(datt.current_company[i],datt.crm_company_name[i])
    datt.partial_token_set_ratio.loc[i] = fuzz.partial_token_set_ratio(datt.current_company[i],datt.crm_company_name[i])
    datt.partial_token_sort_ratio.loc[i] = fuzz.partial_token_sort_ratio(datt.current_company[i],datt.crm_company_name[i])
    datt.QRatio.loc[i] = fuzz.QRatio(datt.current_company[i],datt.crm_company_name[i])
    datt.token_set_ratio.loc[i] = fuzz.token_set_ratio(datt.current_company[i],datt.crm_company_name[i])
    datt.token_sort_ratio.loc[i] = fuzz.token_sort_ratio(datt.current_company[i],datt.crm_company_name[i])
    datt.UQRatio.loc[i] = fuzz.UQRatio(datt.current_company[i],datt.crm_company_name[i])
    datt.UWRatio.loc[i] = fuzz.UWRatio(datt.current_company[i],datt.crm_company_name[i])
    datt.WRatio.loc[i] = fuzz.WRatio(datt.current_company[i],datt.crm_company_name[i])

Is there any way i can avoid the loop and use a vectorized form of the function? Each function in the loop needs 2 parameters.

Thanks!!

1 Answer 1

3

You can use row-wise apply on your dataframe. Here is a toy example:

import pandas as pd
def multiply(x,y):
    return x*y

df = pd.DataFrame({"a": range(1,10000), "b": range(1,10000)})

df["c"] = df.apply(lambda x: multiply(x.a, x.b), 1)

This will, in my opinion, make you code a little bit cleaner by avoiding the loop - but I assume that this will not increase the performance.

You could try to use numpy.vectorize:

import numpy as np
df["c"] = np.vectorize(multiply, otypes=["O"]) (df.a, df.b)

For my toy example, this speeds up quite a bit, but I do not know what the fuzzy functions entail, so I am not sure there.

Hope it helps!

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.