Pandas to_sql fails on duplicate primary key

Question

I'd like to append to an existing table, using pandas df.to_sql() function.

I set if_exists='append', but my table has primary keys.

I'd like to do the equivalent of insert ignore when trying to append to the existing table, so I would avoid a duplicate entry error.

Is this possible with pandas, or do I need to write an explicit query?

possible duplicate of Appending Pandas dataframe to sqlite table by primary key — maxymoo
– maxymoo, Commented May 20, 2015 at 0:15

NFern · Accepted Answer · 2017-02-09 16:50:29Z

38

There is unfortunately no option to specify "INSERT IGNORE". This is how I got around that limitation to insert rows into that database that were not duplicates (dataframe name is df)

for i in range(len(df)):
    try:
        df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
    except IntegrityError:
        pass #or any other action

edited Feb 9, 2017 at 16:50

answered Mar 9, 2016 at 2:05

NFern

2,03621 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

theStud54 Over a year ago

dont forget to add if_exists='append' as a parameter

miro Over a year ago

this solves the problem,...but it slows down the query VEEEEEERY MUCH

Halee Over a year ago

For those using sqlalchemy, this is what worked for me: Adding this import: from sqlalchemy import exc and changing the exception to this: except exc.IntegrityError as e:. Like @miro said, it does slow down the process by a lot.

DirtyBit Over a year ago

What if there are columns like created_at and updated_at in the table that are auto-filled. this approach doesnt work then!

Jayen · Accepted Answer · 2021-11-02 11:47:20Z

31

You can do this with the method parameter of to_sql:

from sqlalchemy.dialects.mysql import insert

def insert_on_duplicate(table, conn, keys, data_iter):
    insert_stmt = insert(table.table).values(list(data_iter))
    on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(insert_stmt.inserted)
    conn.execute(on_duplicate_key_stmt)

df.to_sql('trades', dbConnection, if_exists='append', chunksize=4096, method=insert_on_duplicate)

for older versions of sqlalchemy, you need to pass a dict to on_duplicate_key_update. i.e., on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(dict(insert_stmt.inserted))

edited Nov 2, 2021 at 11:47

answered Oct 3, 2021 at 2:16

Jayen

6,1173 gold badges53 silver badges69 bronze badges

9 Comments

Huy Tran Over a year ago

got an error raise ValueError("update parameter must be a non-empty dictionary") ValueError: update parameter must be a non-empty dictionary

Jayen Over a year ago

@HuyTran i'm not sure why you would get that. does the db table exist already? does your dataframe's columns match the table's columns?

Jayen Over a year ago

@HuyTran what version of pandas are you using?

Grimlock Over a year ago

@jayen Can you please explain your answer? For example, on how insert_stmt.inserted behaves? I intend to use your function, but want slightly different behavior. This function seems to be causing issue like this: dba.stackexchange.com/questions/60295/…

Jayen Over a year ago

usually i see someone modify the top answer, and then people comment that my lower-rated, later answer is a duplicate.

|

user8557323 · Accepted Answer · 2017-09-04 07:07:24Z

please note that the "if_exists='append'" related to the existing of the table and what to do in case the table not exists. The if_exists don't related to the content of the table. see the doc here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’ fail: If table exists, do nothing. replace: If table exists, drop it, recreate it, and insert data. append: If table exists, insert data. Create if does not exist.

Huy Tran · Accepted Answer · 2022-01-10 13:55:05Z

The for loop method above slow things down significantly. There's a method parameter you can pass to panda.to_sql to help achieve customization for your sql query

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql

The below code should work for postgres and do nothing if there's a conflict with primary key "unique_code". Change your insert dialects for your db.

def insert_do_nothing_on_conflicts(sqltable, conn, keys, data_iter):
    """
    Execute SQL statement inserting data

    Parameters
    ----------
    sqltable : pandas.io.sql.SQLTable
    conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
    keys : list of str
        Column names
    data_iter : Iterable that iterates the values to be inserted
    """
    from sqlalchemy.dialects.postgresql import insert
    from sqlalchemy import table, column
    columns=[]
    for c in keys:
        columns.append(column(c))

    if sqltable.schema:
        table_name = '{}.{}'.format(sqltable.schema, sqltable.name)
    else:
        table_name = sqltable.name

    mytable = table(table_name, *columns)

    insert_stmt = insert(mytable).values(list(data_iter))
    do_nothing_stmt = insert_stmt.on_conflict_do_nothing(index_elements=['unique_code'])

    conn.execute(do_nothing_stmt)

pd.to_sql('mytable', con=sql_engine, method=insert_do_nothing_on_conflicts)

gies0r · Accepted Answer · 2019-05-26 21:33:54Z

5

Pandas has no option for it currently, but here is the Github issue. If you need this feature too, just upvote for it.

edited May 26, 2019 at 21:33

answered May 14, 2019 at 21:11

gies0r

5,2974 gold badges47 silver badges56 bronze badges

1 Comment

KIC Over a year ago

and for the meantime there is pangres pypi.org/project/pangres

manglano · Accepted Answer · 2015-05-20 00:05:53Z

2

Pandas doesn't support editing the actual SQL syntax of the .to_sql method, so you might be out of luck. There's some experimental programmatic workarounds (say, read the Dataframe to a SQLAlchemy object with CALCHIPAN and use SQLAlchemy for the transaction), but you may be better served by writing your DataFrame to a CSV and loading it with an explicit MySQL function.

CALCHIPAN repo: https://bitbucket.org/zzzeek/calchipan/

answered May 20, 2015 at 0:05

manglano

8441 gold badge8 silver badges21 bronze badges

1 Comment

Jayen Over a year ago

pandas.pydata.org/pandas-docs/stable/whatsnew/… pandas.DataFrame.to_sql() has gained the method argument to control SQL insertion clause. See the insertion method section in the documentation. (GH8953)

kztd · Accepted Answer · 2018-08-19 22:20:15Z

1

I had trouble where I was still getting the IntegrityError

...strange but I just took the above and worked it backwards:

for i, row in df.iterrows():
    sql = "SELECT * FROM `Table_Name` WHERE `key` = '{}'".format(row.Key)
    found = pd.read_sql(sql, con=Engine)
    if len(found) == 0:
        df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)

answered Aug 19, 2018 at 22:20

kztd

3,4232 gold badges23 silver badges21 bronze badges

Comments

Al-Mothafar · Accepted Answer · 2019-03-25 10:02:06Z

In my case, I was trying to insert new data in an empty table, but some of the rows are duplicated, almost the same issue here, I "may" think about fetching existing data and merge with the new data I got and continue in process, but this is not optimal, and may work only for small data, not a huge tables.

As pandas do not provide any kind of handling for this situation right now, I was looking for a suitable workaround for this, so I made my own, not sure if that will work or not for you, but I decided to control my data first instead of luck of waiting if that worked or not, so what I did is removing duplicates before I call .to_sql so if any error happens, I know more about my data and make sure I know what is going on:

import pandas as pd


def write_to_table(table_name, data):
    df = pd.DataFrame(data)
    # Sort by price, so we remove the duplicates after keeping the lowest only
    data.sort(key=lambda row: row['price'])
    df.drop_duplicates(subset=['id_key'], keep='first', inplace=True)
    #
    df.to_sql(table_name, engine, index=False, if_exists='append', schema='public')

So in my case, I wanted to keep the lowest price of rows (btw I was passing an array of dict for data), and for that, I did sorting first, not necessary but this is an example of what I mean with control the data that I want to keep.

I hope this will help someone who got almost the same as my situation.

Rens · Accepted Answer · 2022-03-30 14:51:51Z

0

When you use SQL Server you'll get a SQL error when you enter a duplicate value into a table that has a primary key constraint. You can fix it by altering your table:

CREATE TABLE [dbo].[DeleteMe](
[id] [uniqueidentifier] NOT NULL,
[Value] [varchar](max) NULL,
CONSTRAINT [PK_DeleteMe] 
PRIMARY KEY ([id] ASC) 
WITH (IGNORE_DUP_KEY = ON)); <-- add

Taken from https://dba.stackexchange.com/a/111771.

Now your df.to_sql() should work again.

answered Mar 30, 2022 at 14:51

Rens

7958 silver badges13 bronze badges

Comments

hnagaty · Accepted Answer · 2022-05-10 14:28:59Z

The solutions by Jayen and Huy Tran helped me a lot, but they didn't work straight out of the box. The problem I faced with Jayen code is that it requires that the DataFrame columns be exactly as those of the database. This was not true in my case as there were some DataFrame columns that I won't write to the database.
I modified the solution so that it considers the column names.

from sqlalchemy.dialects.mysql import insert
import itertools

def insertWithConflicts(sqltable, conn, keys, data_iter):
    """
    Execute SQL statement inserting data, whilst taking care of conflicts
    Used to handle duplicate key errors during database population
    This is my modification of the code snippet 
    from https://stackoverflow.com/questions/30337394/pandas-to-sql-fails-on-duplicate-primary-key

    The help page from https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.values
    proved useful.
    
    Parameters
    ----------
    sqltable : pandas.io.sql.SQLTable
    conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
    keys : list of str
        Column names
    data_iter : Iterable that iterates the values to be inserted. It is a zip object.
                The length of it is equal to the chunck size passed in df_to_sql()
    """
    vals = [dict(zip(z[0],z[1])) for z in zip(itertools.cycle([keys]),data_iter)] 
    insertStmt = insert(sqltable.table).values(vals)
    doNothingStmt = insertStmt.on_duplicate_key_update(dict(insertStmt.inserted))
    conn.execute(doNothingStmt)

Marco Riezzo · Accepted Answer · 2022-10-21 12:18:26Z

I faced the same issue and I adopted the solution provided by @Huy Tran for a while, until my tables started to have schemas. I had to improve his answer a bit and this is the final result:

def do_nothing_on_conflicts(sql_table, conn, keys, data_iter):
"""
Execute SQL statement inserting data

Parameters
----------
sql_table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
    Column names
data_iter : Iterable that iterates the values to be inserted
"""
columns = []
for c in keys:
    columns.append(column(c))

if sql_table.schema:
    my_table = table(sql_table.name, *columns, schema=sql_table.schema)
    # table_name = '{}.{}'.format(sql_table.schema, sql_table.name)
else:
    my_table = table(sql_table.name, *columns)
    # table_name = sql_table.name

# my_table = table(table_name, *columns)

insert_stmt = insert(my_table).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing()

conn.execute(do_nothing_stmt)

How to use it:

history.to_sql('history', schema=schema, con=engine, method=do_nothing_on_conflicts)

Dinesh Marimuthu · Accepted Answer · 2023-01-23 10:43:55Z

The idea is the same as @Nfern's but uses recursive function to divide the df into half in each iteration to skip the row/rows causing the integrity violation.

        def insert(df):

          try:
             # inserting into backup table
             df.to_sql("table",con=engine, if_exists='append',index=False,schema='schema') 
         except:
            rows = df.shape[0]
            if rows>1:
                df1 = df.iloc[:int(rows/2),:]
                df2 = df.iloc[int(rows/2):,:]
                insert(df1)
                insert(df2)
            else:
                print(f"{df} not inserted. Integrity violation, duplicate primary key/s")

user3793803 · Accepted Answer · 2023-06-01 09:15:56Z

Same as @Jayen but for postgresql and do nothing on conflict logic (See sqlalchemy doc)

from sqlalchemy.dialects.postgresql import insert

def insert_or_do_nothing_on_conflict(table, conn, keys, data_iter):
        insert_stmt = insert(table.table).values(list(data_iter))
        # you need to specify column(s) name(s) used to infer unique index
        on_duplicate_key_stmt = insert_stmt.on_conflict_do_nothing(index_elements=['column_index1', 'column_index2'])
        conn.execute(on_duplicate_key_stmt)


df.to_sql(
    name="table_name",
    schema="schema_name",
    con=engine,
    if_exists="append",
    index=False,
    method=insert_or_do_nothing_on_conflict
)

Ash · Accepted Answer · 2023-09-07 23:26:40Z

0

I would explicitly search for IDs that already exist and update each as a separate function or maybe get one dataframe that has all data from T2 table and then another table that has all data from T1 table then you join them on ID and do your update statements when your matching columns are not the same

answered Sep 7, 2023 at 23:26

Ash

13 bronze badges

2 Comments

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Yogendra Over a year ago

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review

Collectives™ on Stack Overflow

Pandas to_sql fails on duplicate primary key

14 Answers 14

4 Comments

9 Comments

Comments

Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

14 Answers 14

4 Comments

9 Comments

Comments

Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Comments

2 Comments

Linked

Related