0

I have a dataframe (lets call it goodsdf) which consists of 240K rows and a bunch of columns:

id name description ... availability price
1001 Item A Frying pan ... 3 20.1
2031 Item B Firewood ... 0 5
3412 Item C Olive oil ... 10 12.5

Now in the next step, I'm constantly reading a stream of updated items. Those updates include among others new prices for items, which are pulled every 90 seconds. The stream I'm receiving includes also some additional 100K items which are not of interest for my store.

What I'm looking into doing is to update the dataframe with new prices. To do so I use the following (partially pseudo) code:

for entity in feed.entity:
    if entity.HasField('product_update'):
        if entity.product_update.id == goodsdf['id']:                  #pseudo
            if goodsdf['availability'] != 0:                           #pseudo
                set goodsdf['price'] == entity.product_update.price    #pseudo

From what I have read, there are several different ways for accessing values in dataframes, e.g. by using isin(), str.contains() and a couple of others. However, many of them return True and False values only. Another way I tried to solve this is by reading new prices and specific item IDs into separate dataframes, which are later merged into my original goodsdf. This in turn showed to create penalties for time and computer resources.

I'm not quite sure I fully understand the concept using nested if statements in combination with updating values in dataframes.

1
  • I'm also considering ditching the idea of using dataframes in favor of setting up SQL database. Commented Jan 27, 2022 at 20:52

1 Answer 1

1

One approach could be to extract the id and price first to create a flattened list of dicts, load this as a new dataframe, merge it with the other based on id, then replace the prices in the original df when they meet the conditions with pd.where(). I'm not sure if it's efficient enough for your use case, but at least it avoids looping through the data:

import pandas as pd

feed = {'entity1':{'product_update':{'id':1001, 'price':999}}, 'entity2':{'product_update':{'id':2031, 'price':999}}, 'entity3':{'superfluous':'test'}}
extracted_feed_data = [v for val in feed.values() if (v := val.get('product_update'))]
data = [ { "id": 1001, "name": "Item A", "description": "Frying pan", "availability": 3, "price": 20.1 }, { "id": 2031, "name": "Item B", "description": "Firewood", "availability": 0, "price": 5 }, { "id": 3412, "name": "Item C", "description": "Olive oil", "availability": 10, "price": 12.5 } ]

df_update = pd.DataFrame(extracted_feed_data)
df = pd.DataFrame(data)

merged = df.merge(df_update, on='id', how='left')
df['price'] = merged['price_y'].where((df['availability'] != 0) & (merged['price_y'].notnull()), df['price'])

Output df:

|    |   id | name   | description   |   availability |   price |
|---:|-----:|:-------|:--------------|---------------:|--------:|
|  0 | 1001 | Item A | Frying pan    |              3 |   999   |
|  1 | 2031 | Item B | Firewood      |              0 |     5   |
|  2 | 3412 | Item C | Olive oil     |             10 |    12.5 |
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the reply! I agree with you, I don't think is very efficient way of handling such amount of data. I give a try and see what results I get, otherwise I think SQL database is much better idea.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.