9

I am adding a single column to a Postgres table with 100+ columns via Django ( a new migration). How can I update a column in a PostgreSQL table with the data from a pandas data_frame? The pseudo-code for Postgres SQL UPDATE would be:

UPDATE wide_table wt
SET wt.z = df.z
WHERE date = 'todays_date'

The reason for doing it this way is that I am computing a column in the data_frame using a CSV that is in S3 (this is df.z). The docs for Postgres update are straightforward to use, but I am unsure how to do this via Django, sqlalchemy, pyodbc, or the like.

I apologize if this is a bit convoluted. A small and incomplete example would be:

Wide Table (pre-update column z)

identifier    |      x       |      y      |      z       |      date       
foo           |      2       |      1      |     0.0      |      ...           
bar           |      2       |      8      |     0.0      |      ...      
baz           |      3       |      7      |     0.0      |      ...      
foo           |      2       |      8      |     0.0      |      ...      
foo           |      1       |      5      |     0.0      |      ...      
baz           |      2       |      8      |     0.0      |      ...      
bar           |      9       |      3      |     0.0      |      ...      
baz           |      2       |      3      |     0.0      |      ...      

Example Python snippet

def apply_function(identifier):
    # Maps baz-> 15.0, bar-> 19.6, foo -> 10.0 for single date
    df = pd.read_csv("s3_file_path/date_file_name.csv")
    # Compute 'z' based on identifier and S3 csv
    return z

postgres_query = "Select identifier from wide_table"
df = pd.read_sql(sql=postgres_query, con=engine)
df['z'] = df.identifier.apply(apply_function)

# Python / SQL Update Logic here to update Postgres Column
???

Wide Table (post-update column z)

identifier    |      x       |      y      |      z        |      date 
foo           |      2       |      1      |     10.0      |      ...     
bar           |      2       |      8      |     19.6      |      ... 
baz           |      3       |      7      |     15.0      |      ... 
foo           |      2       |      8      |     10.0      |      ... 
foo           |      1       |      5      |     10.0      |      ... 
baz           |      2       |      8      |     15.0      |      ... 
bar           |      9       |      3      |     19.6      |      ... 
baz           |      2       |      3      |     15.0      |      ... 

NOTE: The values in z will change daily so simply creating another table to hold these z values is not a great solution. Also, I'd really prefer to avoid deleting all of the data and adding it back.

0

2 Answers 2

13

Ran into a similar problem and the current accepted solution was too slow for me. My table had 500k+ rows and i needed to update 100k+ rows. After lengthy research and trial and error i arrived at an efficient and correct solution.

The idea is to use psycopg as your writer and to use a temp table. df is your pandas dataframe that contains values you want to set.

import psycopg2

conn = psycopg2.connect("dbname='db' user='user' host='localhost' password='test'")
cur = conn.cursor()

rows = zip(df.id, df.z)
cur.execute("""CREATE TEMP TABLE codelist(id INTEGER, z INTEGER) ON COMMIT DROP""")
cur.executemany("""INSERT INTO codelist (id, z) VALUES(%s, %s)""", rows)

cur.execute("""
    UPDATE table_name
    SET z = codelist.z
    FROM codelist
    WHERE codelist.id = vehicle.id;
    """)

cur.rowcount
conn.commit()
cur.close()
conn.close()
Sign up to request clarification or add additional context in comments.

4 Comments

Nice solution! Do you know why this is faster than my solution above?
For one, in your soln you are touching rows that don't get updated as part of your data prep. In my case, that is a lot of rows. Probably the bigger optimization here is relying on SQL to build the to-be-set data rather than using python. Also your soln keeps reading csv on every fn call.
Ah thanks. Yes that read_csv in the function is a very silly mistake for sure. I'l be sure to try out your solution if I need this again in the future.
Is this fast for huge amounts of rows, 1M+ ? If I use executemany, I find executemany does 1 UPDATE per row i think, which for me is really slow. Thanks.
5

I managed to cobble together a solution myself where I zip the id and z values and then execute a generic SQL UPDATE statement and utilizing SQL UPDATE FROM VALUES.

Data Prep

sql_query= "SELECT id, a FROM wide_table"
df = pd.read_sql(sql=sql_query, con=engine)
df['z'] = df.a.apply(apply_function)

zipped_vals = zip(df.id, df.z)
tuple_to_str= str(tuple(zipped_vals))
entries_to_update = tuple_to_str[1:len(tuple_to_str)-1] # remove first and last paren in tuple

SQL Query Solution:

# Update column z by matching ID from SQL Table & Pandas DataFrame
update_sql_query = f"""UPDATE wide_table t SET z = v.z
                        FROM (VALUES {entries_to_update}) AS v (id, z)
                        WHERE t.id = v.id;"""

with engine.begin() as conn:
    conn.execute(update_sql_query)

conn.exec(sql_query)

Answer on updating PostgreSQL table column from values

PostgreSQL update docs

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.