1

Pandas newbie.

A SQL table is made of 3 columns (ID is the primary key):

> ID    VALUE1    VALUE2 
> 1       11        28 
> 2       21      (None) 
> 3       31        56 
> 4       41      (None)

With Pandas I load all the rows where VALUE2 is (None):

query = "SELECT * FROM `TABLE_NAME`  WHERE (`VALUE2` IS NULL)"
engine = create_engine("mysql://user:pwd@ip/db"
df = pd.read_sql(query, con=engine)
engine.dispose()

Everything ok till now.

Following the load the missing VALUE2 are calculated according to some rules.

THE PROBLEM

If I update the database with

df.to_sql(TABLE_NAME, con=engine, if_exists="replace", index=False)

All the original lines that were not loaded into the dataframe are LOST:

> ID    VALUE1    VALUE2 
> 2       21       103 
> 4       41        72

Is there a way to update leaving the original lines untouched?

I want to obtain this:

> ID    VALUE1    VALUE2 
> 1       11        28 
> 2       21       103 
> 3       31        56 
> 4       41        72

It looks like the whole table is rewritten instead of updated...

It would be highly inefficient to load the whole table just to update a few rows. That would virtually solve the problem but it is not acceptable.

Any idea about "why"?

2 Answers 2

2

You're using the option if_exists="replace".

From the Pandas documentation (my emboldening):

replace: If table exists, drop it, recreate it, and insert data.

So it's doing exactly what you're asking of it. You can try playing around with if_exists="append" instead, but it still might not give you the behaviour you're looking for.

Alternatively, you can interact with your table directly using MySQLdb, and use UPDATE.

Sign up to request clarification or add additional context in comments.

1 Comment

It looks like I misunderstood the "replace": overlooking the docs I thought it is referred to the single record, not to the whole table. Thank you.
1

It's a case of mixing the best of two worlds. Do what you are doing at the moment, but use a different table. This is essentially a temporary table but AFAIK pandas doesn't support them so let's just drop it later.

df.to_sql(tmp_table_name, con=engine, if_exists="replace", index=False)

Then we make use of the INSERT ON DUPLICATE KEY syntax

INSERT INTO TABLE_NAME (SELECT * FROM tmp_table) ON DUPLICATE KEY UPDATE a = VALUES(a), b=VALUES(b) ....

This would usually be a fast operation.

2 Comments

It is a little strange that it is necessary to go through a double step for something so "simple". Pandas was very close to do the job alone... Thank you!
glad to have been of help

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.