How to load and save partial table in a MySQL-DB with Pandas

Question

Pandas newbie.

A SQL table is made of 3 columns (ID is the primary key):

> ID    VALUE1    VALUE2 
> 1       11        28 
> 2       21      (None) 
> 3       31        56 
> 4       41      (None)

With Pandas I load all the rows where VALUE2 is (None):

query = "SELECT * FROM `TABLE_NAME`  WHERE (`VALUE2` IS NULL)"
engine = create_engine("mysql://user:pwd@ip/db"
df = pd.read_sql(query, con=engine)
engine.dispose()

Everything ok till now.

Following the load the missing VALUE2 are calculated according to some rules.

THE PROBLEM

If I update the database with

df.to_sql(TABLE_NAME, con=engine, if_exists="replace", index=False)

All the original lines that were not loaded into the dataframe are LOST:

> ID    VALUE1    VALUE2 
> 2       21       103 
> 4       41        72

Is there a way to update leaving the original lines untouched?

I want to obtain this:

> ID    VALUE1    VALUE2 
> 1       11        28 
> 2       21       103 
> 3       31        56 
> 4       41        72

It looks like the whole table is rewritten instead of updated...

It would be highly inefficient to load the whole table just to update a few rows. That would virtually solve the problem but it is not acceptable.

Any idea about "why"?

Tom M · Accepted Answer · 2017-05-19 15:06:10Z

2

You're using the option if_exists="replace".

From the Pandas documentation (my emboldening):

replace: If table exists, drop it, recreate it, and insert data.

So it's doing exactly what you're asking of it. You can try playing around with if_exists="append" instead, but it still might not give you the behaviour you're looking for.

Alternatively, you can interact with your table directly using MySQLdb, and use UPDATE.

answered May 19, 2017 at 15:06

Tom M

4171 gold badge5 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Alex Poca Over a year ago

It looks like I misunderstood the "replace": overlooking the docs I thought it is referred to the single record, not to the whole table. Thank you.

e4c5 · Accepted Answer · 2017-05-19 15:04:39Z

1

It's a case of mixing the best of two worlds. Do what you are doing at the moment, but use a different table. This is essentially a temporary table but AFAIK pandas doesn't support them so let's just drop it later.

df.to_sql(tmp_table_name, con=engine, if_exists="replace", index=False)

Then we make use of the INSERT ON DUPLICATE KEY syntax

INSERT INTO TABLE_NAME (SELECT * FROM tmp_table) ON DUPLICATE KEY UPDATE a = VALUES(a), b=VALUES(b) ....

This would usually be a fast operation.

answered May 19, 2017 at 15:04

e4c5

53.9k11 gold badges110 silver badges139 bronze badges

2 Comments

Alex Poca Over a year ago

It is a little strange that it is necessary to go through a double step for something so "simple". Pandas was very close to do the job alone... Thank you!

e4c5 Over a year ago

glad to have been of help

Collectives™ on Stack Overflow

How to load and save partial table in a MySQL-DB with Pandas

2 Answers 2

1 Comment

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Related