1. Use itertuples() instead
Pandas DataFrames are really a collection of columns/Series objects (e.g. for x in df iterates over the column labels), so even if a loop where to be implemented, it's better if the loop over across columns. iterrows() is anti-pattern to that "native" pandas behavior because it creates a Series for each row, which slows down code so much. A better/faster option is to use itertuples(). It creates namedtuples of each row that you can access either by index or column label. There's almost no modification to the code in the OP to apply it.
Also (as @Alireza Mazochi mentioned), to assign a value to a single cell, at is faster than loc.
for row in rche_df.itertuples():
# ^^^^^^^^^^ <------ `itertuples` instead of `iterrows`
if isinstance(row.wgs1984_latitude, float):
target = row.address_chi
dict_temp = geocoding(target)
rche_df.at[row.Index, 'wgs1984_latitude'] = dict_temp['lat']
rche_df.at[row.Index, 'wgs1984_longitude'] = dict_temp['long']
# ^^ ^^^^^^^^^ <---- `at` instead of `loc` for faster assignment
# `row.Index` is the row's index, can also use `row[0]`
As you can see, using itertuples() is almost the same syntax as using iterrows(), yet it's over 6 times faster (you can verify it with a simple timeit test).
2. to_dict() is also an option
One drawback of itertuples() is that whenever there's a space in a column label (e.g. 'Col A' etc.), it will be mangled when converted into a namedtuple, so e.g. if 'address_chi' was 'address chi', it will not be possible to access it via row.address chi. One way to solve this problem is to convert the DataFrame into a dictionary and iterate over it.
Again, the syntax is almost the same as the one used for iterrows().
for index, row in rche_df.to_dict('index').items():
# ^^^^^^^^^^^^^^^^^^^^^^^^ <---- convert to a dict
if isinstance(row['wgs1984_latitude'], float):
target = row['address_chi']
dict_temp = geocoding(target)
rche_df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
rche_df.at[index, 'wgs1984_longitude'] = dict_temp['long']
This method is also about 6 times faster than iterrows() but slightly slower than itertuples() (also it's more memory-intensive than itertuples() because it creates an explicit dictionary whereas itertuples() creates a generator).
3. Iterate over only the necessary column/rows
The main bottleneck in the particular code in the OP (and in general, why a loop is sometimes necessary in a pandas dataframe) is that the function geocoding() is not vectorized. So one way to make the code much faster is to call it only on the relevant column ('address_chi') and on the relevant rows (filtered using a boolean mask).
Note that creating the boolean mask was necessary only because there was an if-clause in the original code. If a conditional check was not needed, the boolean mask is not needed, so the necessary loop boils down to a single loop over a particular column (address_chi).
# boolean mask to filter only the relevant rows
# this is analogous to if-clause in the loop in the OP
msk = [isinstance(row, float) for row in rche_df['wgs1984_latitude'].tolist()]
# call geocoding on the relevant values
# (filtered using the boolean mask built above)
# in the address_chi column
# and create a nested list
out = []
for target in rche_df.loc[msk, 'address_chi'].tolist():
dict_temp = geocoding(target)
out.append([dict_temp['lat'], dict_temp['long']])
# assign the nested list to the relevant rows of the original frame
rche_df.loc[msk, ['wgs1984_latitude', 'wgs1984_longitude']] = out
This method is about 40 times faster than iterrows().
A working example and performance test
def geocoding(x):
return {'lat': x*2, 'long': x*2}
def iterrows_(df):
for index, row in df.iterrows():
if isinstance(row.wgs1984_latitude, float):
target = row.address_chi
dict_temp = geocoding(target)
df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
df.at[index, 'wgs1984_longitude'] = dict_temp['long']
return df
def itertuples_(df):
for row in df.itertuples():
if isinstance(row.wgs1984_latitude, float):
target = row.address_chi
dict_temp = geocoding(target)
df.at[row.Index, 'wgs1984_latitude'] = dict_temp['lat']
df.at[row.Index, 'wgs1984_longitude'] = dict_temp['long']
return df
def to_dict_(df):
for index, row in df.to_dict('index').items():
if isinstance(row['wgs1984_latitude'], float):
target = row['address_chi']
dict_temp = geocoding(target)
df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
df.at[index, 'wgs1984_longitude'] = dict_temp['long']
return df
def boolean_mask_loop(df):
msk = [isinstance(row, float) for row in df['wgs1984_latitude'].tolist()]
out = []
for target in df.loc[msk, 'address_chi'].tolist():
dict_temp = geocoding(target)
out.append([dict_temp['lat'], dict_temp['long']])
df.loc[msk, ['wgs1984_latitude', 'wgs1984_longitude']] = out
return df
df = pd.DataFrame({'address_chi': range(20000)})
df['wgs1984_latitude'] = pd.Series([x if x%2 else float('nan') for x in df['address_chi'].tolist()], dtype=object)
%timeit itertuples_(df.copy())
# 248 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit boolean_mask_loop(df.copy())
# 38.7 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit to_dict_(df.copy())
# 289 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterrows_(df.copy())
# 1.57 s ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
rche_df.loc[index, 'wgs1984_latitude'] = dict_temp['lat'], i.e. use the index to get at the right section of the original dataframe. Let me know if that doesn't work and I'll try work up a proper answer.