0

i am doing a lot of sql to pandas and i have run in to the following challenge.

I have a dataframe, that looks like

UserID, AccountNo, AccountName
123,    12345,     'Some name'
...

What i would like to do is for each account number, i would like to add a column called total revenue which is gotten from a mysql database, som i am thinking of something like,

for accountno in df['AccountNo']:
    df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)

And i need to expand the the dataframe such that

UserID, AccountNo, AccountName, TotalRevenue
123,    12345,     'Some name', df1
...

The code that i have so far (and is not working casts a getitem error)

sets3 = []
i=0
for accountno in df5['kna1_kunnr']:
    df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)
    df2 = pd.DataFrame([(df5['userid'][i], df5['kna1_kunnr'][i], accountno, df5['kna1_name1'][i], df1['sum'][0])], columns=['User ID', 'AccountNo', 'tjeck', 'AccountName', 'Revenue'])
    sets3.append(df2)
    i += 1

df6 = pd.concat(sets3)

This idea/code is not pretty, and i wonder if there is a better/nicer way to do it, any ideas?

4
  • The data comes from a lot of different databases and tables, it is very much cut and paste procedures i am doing. Commented Aug 31, 2017 at 14:18
  • then i think you should read all of your tables into individual dataframes, join them together on appropriate keys, and then do your summarization. Show us 5 - 10 rows from a representative selection of your tables Commented Aug 31, 2017 at 14:44
  • @PaulH - or the reverse, write all data to DB and run queries in MySQL Commented Aug 31, 2017 at 16:17
  • @parfait Of course. But this question seems more focused on the pandas-end of things. Commented Aug 31, 2017 at 17:28

1 Answer 1

1

Consider exporting pandas data to MySQL as a temp table then run an SQL query that joins your pandas data and an aggregate query for TotalRevenue. Then, read resultset into pandas dataframe. This approach avoids any looping.

from sqlalchemy import create_engine
...

# SQL ALCHEMY CONNECTION (PREFERRED OVER RAW CONNECTION)
engine = create_engine('mysql://user:pwd@localhost/database')
# engine = create_engine("mysql+pymysql://user:pwd@hostname:port/database") # load pymysql

df1.to_sql("mypandastemptable", con=engine, if_exists='replace')

sql = """SELECT t.UserID, t.AccountNo, t.AccountName, agg.TotalRevenue
         FROM mypandastemptable t
         LEFT JOIN 
            (SELECT VBRK_KUNAG as AccountNo
                    SUM(VBRK_NETWR) as TotalRevenue
             FROM sapdata2016.orders 
             WHERE VBAK_BSARK='ZEDI'
             GROUP BY VBRK_KUNAG) agg
         ON t.AccountNo = agg.AccountNo) 
"""

newdf = pd.read_sql(sql, con=engine)

Of course the converse is true as well, merging on two pandas dataframes of existing dataframe and the grouped aggregate query resultset:

sql = """SELECT VBRK_KUNAG as AccountNo
                SUM(VBRK_NETWR) as TotalRevenue
         FROM sapdata2016.orders 
         WHERE VBAK_BSARK='ZEDI'
         GROUP BY VBRK_KUNAG 
"""

df2 = pd.read_sql(sql, con=engine)

newdf = df1.merge(df2, on='AccountNo', how='left')
Sign up to request clarification or add additional context in comments.

2 Comments

The latter seems like the thing i was looking for! Thanks a lot - quick follow up, i have been using the raw connection because i could not get sqlalchemy to work. But i hope with this example i will get i to :-)
Got it. And do note the commented out slqalchemy connection that interfaces with pymysql module.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.