4

I have a Pandas dataframe with 1000s of rows. and it has the Names column includes the customer names and their records. I want to create individual dataframes for each customer based on their unique names. I got the unique names into a list

customerNames = DataFrame['customer name'].unique().tolist() this gives the following array

['Name1', 'Name2', 'Name3, 'Name4']

I tried a loop by catching the unique names in the above list and creating dataframes for each name and assign the dataframes to the customer name. So for example when I write Name3, it should give the Name3's data as a separate dataframe

for x in customerNames:
    x = DataFrame.loc[DataFrame['customer name'] == x]
x

Above lines returned the dataframe for only Name4 as dataframe result, but skipped the rest.

How can I solve this problem?

0

3 Answers 3

10

Your current iteration overwrites x twice every time it runs: the for loop assigns a customer name to x, and then you assign a dataframe to it.

To be able to call each dataframe later by name, try storing them in a dictionary:

df_dict = {name: df.loc[df['customer name'] == name] for name in customerNames}

df_dict['Name3']
Sign up to request clarification or add additional context in comments.

Comments

9

To create a dataframe for all the unique values in a column, create a dict of dataframes, as follows.

  • Creates a dict, where each key is a unique value from the column of choice and the value is a dataframe.
  • Access each dataframe as you would a standard dict (e.g. df_names['Name1'])
  • .groupby() creates a generator, which can be unpacked.
    • k is the unique values in the column and v is the data associated with each k.

With a for-loop and .groupby:

df_names = dict()
for k, v in df.groupby('customer name'):
    df_names[k] = v

With a Python Dictionary Comprehension

Using .groupby

df_names = {k: v for (k, v) in df.groupby('customer name')}
  • This comes from a conversation with rafaelc, who pointed out that using .groupby is faster than .unique.
    • With 6 unique values in the column, .groupby is faster, at 104 ms compared to 392 ms
    • With 26 unique values in the column, .groupby is faster, at 147 ms compared to 1.53 s.
  • Using an a for-loop is slightly faster than a comprehension, particularly for more unique column values or lots of rows (e.g. 10M).

Using .unique:

df_names = {name: df[df['customer name'] == name] for name in df['customer name'].unique()}

Testing

  • The following data was used for testing
import pandas as pd
import string
import random

random.seed(365)

# 6 unique values
data = {'class': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(1000000)],
        'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}

# 26 unique values
data = {'class': [random.choice( list(string.ascii_lowercase)) for _ in range(1000000)],
        'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}

df = pd.DataFrame(data)

Comments

0

maybe i get you wrong but

when

for x in customerNames:
    x = DataFrame.loc[DataFrame['customer name'] == x]
x

gives you the right output for the last list entry its because your output is out of the indent of the loop

import pandas as pd

customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA'])],
                        orient='index', columns=['customer', 'country'])

customer_list = ['James', 'Jean']

for x in customer_list:
    x = customer_df.loc[customer_df['customer'] == x]
    print(x)
    print('now I could append the data to something new')

you get the output:

  customer country
B    James     USA
now I could append the data to something new
  customer country
A     Jean  France
now I could append the data to something new

Or if you dont like loops you could go with

import pandas as pd

customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA']),('C', ['Hans', 'Germany'])],
                        orient='index', columns=['customer', 'country'])

customer_list = ['James', 'Jean']


print(customer_df[customer_df['customer'].isin(customer_list)])

Output:

  customer country
A     Jean  France
B    James     USA

df.isin is better explained under:How to implement 'in' and 'not in' for Pandas dataframe

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.