Updating a value in a pandas dataframe in an iterrows loop [duplicate]

Question

I am doing some geocoding work that I used selenium to screen scrape the x-y coordinate I need for address of a location, I imported an xls file to pandas dataframe and want to use an explicit loop to update the rows which do not have the x-y coordinate, like below:

for index, row in rche_df.iterrows():
    if isinstance(row.wgs1984_latitude, float):
        row = row.copy()
        target = row.address_chi
        dict_temp = geocoding(target)
        row.wgs1984_latitude = dict_temp['lat']
        row.wgs1984_longitude = dict_temp['long']

I have read Why doesn't this function "take" after I iterrows over a pandas DataFrame? and am fully aware that iterrows only gives us a view rather than a copy for editing, but what if I really to update the value row by row? Is lambda feasible?

I think you can do rche_df.loc[index, 'wgs1984_latitude'] = dict_temp['lat'], i.e. use the index to get at the right section of the original dataframe. Let me know if that doesn't work and I'll try work up a proper answer. — Marius
– Marius, Commented Aug 25, 2014 at 3:39
@Marius looks like it is working, thanks, another alternative is to convert the dataframe into dict and use ordinary for-loop to do the modification — lokheart
– lokheart, Commented Aug 25, 2014 at 3:43
This answer did not work for me (why on Earth not...), but this did: stackoverflow.com/questions/23330654/… — Pablo
– Pablo, Commented May 25, 2018 at 14:57

Marius · Accepted Answer · 2014-08-25 04:04:16Z

230

The rows you get back from iterrows are copies that are no longer connected to the original data frame, so edits don't change your dataframe. Thankfully, because each item you get back from iterrows contains the current index, you can use that to access and edit the relevant row of the dataframe:

for index, row in rche_df.iterrows():
    if isinstance(row.wgs1984_latitude, float):
        row = row.copy()
        target = row.address_chi        
        dict_temp = geocoding(target)
        rche_df.loc[index, 'wgs1984_latitude'] = dict_temp['lat']
        rche_df.loc[index, 'wgs1984_longitude'] = dict_temp['long']

In my experience, this approach seems slower than using an approach like apply or map, but as always, it's up to you to decide how to make the performance/ease of coding tradeoff.

answered Aug 25, 2014 at 4:04

Marius

60.5k16 gold badges115 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Andy Hayden Over a year ago

This is not strictly true, they may not be copies. Specifically if the dtype is the same for all cols

Peter Ehrlich Over a year ago

This gave a copy warning for me. Ended up using: stackoverflow.com/questions/33518124/…

dumbledad Over a year ago

Don't you get the index back anyway? See @jpp's answer to Pandas for loop over dataframe gives too many values to unpack. The error I get from the code in this answer is ValueError: too many values to unpack (expected 2)

Gamma032 Over a year ago

I found that I had to run df = df.reset_index() to get this working without an index error because I had chopped and sliced my dataframe.

Alireza Mazochi · Accepted Answer · 2021-12-10 09:16:52Z

11

Another way based on this question:

for index, row in rche_df.iterrows():
    if isinstance(row.wgs1984_latitude, float):
        row = row.copy()
        target = row.address_chi        
        dict_temp = geocoding(target)
        
        rche_df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
        rche_df.at[index, 'wgs1984_longitude'] = dict_temp['long']

This link describe difference between .loc and .at. Shortly, .at faster than .loc.

answered Dec 10, 2021 at 9:16

Alireza Mazochi

9872 gold badges18 silver badges28 bronze badges

Comments

cottontail · Accepted Answer · 2023-04-02 19:53:15Z

1. Use `itertuples()` instead

Pandas DataFrames are really a collection of columns/Series objects (e.g. for x in df iterates over the column labels), so even if a loop where to be implemented, it's better if the loop over across columns. iterrows() is anti-pattern to that "native" pandas behavior because it creates a Series for each row, which slows down code so much. A better/faster option is to use itertuples(). It creates namedtuples of each row that you can access either by index or column label. There's almost no modification to the code in the OP to apply it.

Also (as @Alireza Mazochi mentioned), to assign a value to a single cell, at is faster than loc.

for row in rche_df.itertuples():
#                  ^^^^^^^^^^   <------ `itertuples` instead of `iterrows`
    if isinstance(row.wgs1984_latitude, float):
        target = row.address_chi
        dict_temp = geocoding(target)
        rche_df.at[row.Index, 'wgs1984_latitude'] = dict_temp['lat']
        rche_df.at[row.Index, 'wgs1984_longitude'] = dict_temp['long']
        #       ^^ ^^^^^^^^^  <---- `at` instead of `loc` for faster assignment
        #                           `row.Index` is the row's index, can also use `row[0]`

As you can see, using itertuples() is almost the same syntax as using iterrows(), yet it's over 6 times faster (you can verify it with a simple timeit test).

2. `to_dict()` is also an option

One drawback of itertuples() is that whenever there's a space in a column label (e.g. 'Col A' etc.), it will be mangled when converted into a namedtuple, so e.g. if 'address_chi' was 'address chi', it will not be possible to access it via row.address chi. One way to solve this problem is to convert the DataFrame into a dictionary and iterate over it.

Again, the syntax is almost the same as the one used for iterrows().

for index, row in rche_df.to_dict('index').items():
#                         ^^^^^^^^^^^^^^^^^^^^^^^^  <---- convert to a dict
    if isinstance(row['wgs1984_latitude'], float):
        target = row['address_chi']
        dict_temp = geocoding(target)
        rche_df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
        rche_df.at[index, 'wgs1984_longitude'] = dict_temp['long']

This method is also about 6 times faster than iterrows() but slightly slower than itertuples() (also it's more memory-intensive than itertuples() because it creates an explicit dictionary whereas itertuples() creates a generator).

3. Iterate over only the necessary column/rows

The main bottleneck in the particular code in the OP (and in general, why a loop is sometimes necessary in a pandas dataframe) is that the function geocoding() is not vectorized. So one way to make the code much faster is to call it only on the relevant column ('address_chi') and on the relevant rows (filtered using a boolean mask).

Note that creating the boolean mask was necessary only because there was an if-clause in the original code. If a conditional check was not needed, the boolean mask is not needed, so the necessary loop boils down to a single loop over a particular column (address_chi).

# boolean mask to filter only the relevant rows
# this is analogous to if-clause in the loop in the OP
msk = [isinstance(row, float) for row in rche_df['wgs1984_latitude'].tolist()]

# call geocoding on the relevant values 
# (filtered using the boolean mask built above) 
# in the address_chi column
# and create a nested list
out = []
for target in rche_df.loc[msk, 'address_chi'].tolist():
    dict_temp = geocoding(target)
    out.append([dict_temp['lat'], dict_temp['long']])

# assign the nested list to the relevant rows of the original frame
rche_df.loc[msk, ['wgs1984_latitude', 'wgs1984_longitude']] = out

This method is about 40 times faster than iterrows().

A working example and performance test

def geocoding(x):
    return {'lat': x*2, 'long': x*2}


def iterrows_(df):
    
    for index, row in df.iterrows():
        if isinstance(row.wgs1984_latitude, float):
            target = row.address_chi        
            dict_temp = geocoding(target)
            df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
            df.at[index, 'wgs1984_longitude'] = dict_temp['long']
    
    return df


def itertuples_(df):
    
    for row in df.itertuples():
        if isinstance(row.wgs1984_latitude, float):
            target = row.address_chi
            dict_temp = geocoding(target)
            df.at[row.Index, 'wgs1984_latitude'] = dict_temp['lat']
            df.at[row.Index, 'wgs1984_longitude'] = dict_temp['long']
        
    return df


def to_dict_(df):
    
    for index, row in df.to_dict('index').items():
        if isinstance(row['wgs1984_latitude'], float):
            target = row['address_chi']
            dict_temp = geocoding(target)
            df.at[index, 'wgs1984_latitude'] = dict_temp['lat']
            df.at[index, 'wgs1984_longitude'] = dict_temp['long']
            
    return df


def boolean_mask_loop(df):

    msk = [isinstance(row, float) for row in df['wgs1984_latitude'].tolist()]

    out = []
    for target in df.loc[msk, 'address_chi'].tolist():
        dict_temp = geocoding(target)
        out.append([dict_temp['lat'], dict_temp['long']])

    df.loc[msk, ['wgs1984_latitude', 'wgs1984_longitude']] = out
    
    return df


df = pd.DataFrame({'address_chi': range(20000)})
df['wgs1984_latitude'] = pd.Series([x if x%2 else float('nan') for x in df['address_chi'].tolist()], dtype=object)


%timeit itertuples_(df.copy())
# 248 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit boolean_mask_loop(df.copy())
# 38.7 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit to_dict_(df.copy())
# 289 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit iterrows_(df.copy())
# 1.57 s ± 27.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Collectives™ on Stack Overflow

Updating a value in a pandas dataframe in an iterrows loop [duplicate]

3 Answers 3

4 Comments

Comments

1. Use `itertuples()` instead

2. `to_dict()` is also an option

3. Iterate over only the necessary column/rows

A working example and performance test

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1. Use itertuples() instead

2. to_dict() is also an option

3. Iterate over only the necessary column/rows

A working example and performance test

Comments

Linked

Related

1. Use `itertuples()` instead

2. `to_dict()` is also an option