0

I'm trying to avoid using iterrows() in pandas and achieve a more performant solution. This is the code I have, where I loop through a DataFrame and for each record I need to add three more:

import pandas as pd

fruit_data = pd.DataFrame({
    'fruit':  ['apple','orange','pear','orange'],
    'color':  ['red','orange','green','green'],
    'weight': [5,6,3,4]
})

array = []

for index, row in fruit_data.iterrows():

    row2 = { 'fruit_2': row['fruit'], 'sequence': 0}
    array.append(row2)
    
    for i in range(2):
        row2 = { 'fruit_2': row['fruit'], 'sequence': i + 1}
        array.append(row2)

print(array)

My real DataFrame has millions of records. Is there a way to optimize this code and NOT use iterrows() or for loops?

5
  • 1
    what are you rtying to achieve by adding stuff froma df into a normal list? Whatfor? Pretty sure whatever you want to be done can be done differently... Commented Mar 14, 2022 at 17:30
  • @PatrickArtner this is a simplification of a more complex problem Commented Mar 14, 2022 at 17:31
  • 1
    @ps0604 You will want to post something that's more representative of the actual problem, then. Especially regarding Pandas performance, simplified problems will lead to bad solutions. Commented Mar 14, 2022 at 17:34
  • you could start with for i in range(3): array.append({ 'fruit_2': row['fruit'], 'sequence': i }) to begin with - seems you dumbed it down too much. that wont get rid of iterrrows but at least your code gets more concise - then try to describe the "what" not the "how" you tried to accomplish it - I am pretty sure that approach is already flawed. Commented Mar 14, 2022 at 17:35
  • This seems like an XY problem. What is the original problem that you are trying to solve? What are you trying to accomplish by building a list of dictionaries from the dataframe? Why don't you use the dataframe directly to solve this problem? If you don't know how to answer the last question, we can probably give suggestions once you answer the other questions. Commented Mar 14, 2022 at 17:36

2 Answers 2

1

You could use repeat to repeat each fruit 3 times; then groupby + cumcount to assign sequence numbers; finally to_dict for the final output:

tmp = fruit_data['fruit'].repeat(3).reset_index(name='fruit_2')
tmp['sequence'] = tmp.groupby('index').cumcount()
out = tmp.drop(columns='index').to_dict('records')

Output:

[{'fruit_2': 'apple', 'sequence': 0},
 {'fruit_2': 'apple', 'sequence': 1},
 {'fruit_2': 'apple', 'sequence': 2},
 {'fruit_2': 'orange', 'sequence': 0},
 {'fruit_2': 'orange', 'sequence': 1},
 {'fruit_2': 'orange', 'sequence': 2},
 {'fruit_2': 'pear', 'sequence': 0},
 {'fruit_2': 'pear', 'sequence': 1},
 {'fruit_2': 'pear', 'sequence': 2},
 {'fruit_2': 'orange', 'sequence': 0},
 {'fruit_2': 'orange', 'sequence': 1},
 {'fruit_2': 'orange', 'sequence': 2}]
Sign up to request clarification or add additional context in comments.

1 Comment

Nice! One-liner: fruit_data['fruit'].repeat(3).reset_index(name='fruit_2').pipe(lambda x: x.assign(sequence=x.groupby('index').cumcount())).drop(columns='index').to_dict('records')
1

Try this out:

array = (
    fruit_data['fruit']
    .repeat(3)
    .to_frame(name='fruit_2')
    .set_index(np.tile(np.arange(3), len(fruit_data['fruit'])))
    .reset_index()
    .rename({'index':'sequence'},axis=1)
    [['fruit_2', 'sequence']]
    .to_dict('records')
)

Output:

>>> array
[{'fruit_2': 'apple', 'sequence': 0},
 {'fruit_2': 'apple', 'sequence': 1},
 {'fruit_2': 'apple', 'sequence': 2},
 {'fruit_2': 'orange', 'sequence': 0},
 {'fruit_2': 'orange', 'sequence': 1},
 {'fruit_2': 'orange', 'sequence': 2},
 {'fruit_2': 'pear', 'sequence': 0},
 {'fruit_2': 'pear', 'sequence': 1},
 {'fruit_2': 'pear', 'sequence': 2},
 {'fruit_2': 'orange', 'sequence': 0},
 {'fruit_2': 'orange', 'sequence': 1},
 {'fruit_2': 'orange', 'sequence': 2}]

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.