3

As I create new data frames for each customer I'd like to also create one giant data frame of all of them appended together.

I've created a function to group user data how I need it. Now I want to iterate over another data frame containing unique user keys and use those user keys to create data frames for each user. I'd then like to aggregate all those data frames into one giant data frame.

for index, row in unique_users.iterrows():
    customer = user_df(int(index))
    print(customer)

This function works as intended and prints a df for each customer

for index, row in unique_users.iterrows():
    top_users = pd.DataFrame()
    customer = user_df(int(index))
    top_users = top_users.append(customer)
print(top_users)

This only prints out the last customer's df

I expect that as it iterates and creates a new customer df it will append that to the top_user df so at the end I have one giant top_user df. But instead it only contains that last customer's df.

9
  • 2
    you re-declare top_users inside your for loop. set top_users = pd.DataFrame() before your loop and it should perform as you expect Commented May 28, 2019 at 23:16
  • 2
    that being said, I doubt that you should be using .iterrows() to perform this aggregation, but it's impossible to tell without seeing the full code Commented May 28, 2019 at 23:17
  • I second the suggestion that likely, what you are doing can be accomplished without .iterrows. If you describe your situation more fully, some pandas wiz can probably guide you to the "pandas way" of doing thing - pandonic you might say. You should consider things like .iterrows and .itertuples as last resorts. Commented May 28, 2019 at 23:23
  • Thanks that worked Commented May 28, 2019 at 23:24
  • 2
    Hi amanda! If you could edit your post to include a faked up couple of input data frames, and a faked up list of what-you-want data frames it would really help. (df1 = pd.DataFrame(....)\n df2 = pd.DataFr..., and so on. I do strongly suspect that you don't want a DataFrame for each user, fwiw. Cheers! Commented May 28, 2019 at 23:39

2 Answers 2

3

As advised by @unutbu: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Instead, build a list of data frames to call pd.concat once outside the loop.

And you can actually handle the data frame build with a list/dictionary comprehension without iterrows but directly using the index values. Using either comprehension, you avoid the bookkeeping of initializing a container and assigning iteratively to it.

# LIST COMPREHENSION APPROACH
df_list = [user_df(int(idx)) for idx in unique_users.index.values]
top_users = pd.concat(df_list, ignore_index=True)

# DICTIONARY COMPREHENSION APPROACH
df_dict = {idx: user_df(int(idx)) for idx in unique_users.index.values}
top_users = pd.concat(df_dict, ignore_index=True)
Sign up to request clarification or add additional context in comments.

Comments

1

This is what I do:

_list = []
for index, row in unique_users.iterrows():
    r = row.to_dict() # Converting the row to dictionary
    _list.append(r) # appending the dictionary to list
    
return pd.DataFrame(_list) # Converts list of dictionaries to a dataframe

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.