How to efficiently UPDATE a column in a large PostgreSQL table using Python / psycopg2?

Question

I have a large table with ca. 10 million rows in PostgreSQL 9.4 database. It looks somewhat like this:

gid | number1 | random | result | ...
 1  |    2    |  NULL  |  NULL  | ...
 2  |   15    |  NULL  |  NULL  | ...
... |   ...   |   ...  |  ...   | ...

Now I would like to update the columns random and result as a function of number1. That means that at least random needs to be produced in a script outside of the database. Since I have limited RAM, I wonder how I could do that efficiently using psycopg2. I believe I face two problems: How to fetch the data without using too much RAM and how to get it back in there. The simpleton approach would look like this:

curs.execute("""SELECT gid1, number1 FROM my_table;""")
data = curs.fetchall()

result = []
for i in data:
    result.append((create_random(i[1]), i[0]))
curs.executemany("""UPDATE my_table
                    SET random = %s
                    WHERE gid = %s;""",
                 results)
curs.execute("""UPDATE my_table
                SET result = number1 * random;""")

However, this will certainly quickly deplete all my memory and take forever to UPDATE my_table.

What would be a smarter strategy? The database is being accessed exclusively and can be locked. The PostgreSQL random function is unfortunately not suitable for my case.

Define “to much RAM”, because 10 million * 3 integers is 240MB, which is more than manageable by many systems. Also, could your generating logic be implemented as a stored procedure? — spectras
– spectras, Commented Oct 21, 2015 at 10:16
Can you use (PL/Python)[postgresql.org/docs/9.4/static/plpython.html]? If so you can write your own replacement for Postgres's random function and run it all as one UPDATE statement inside the database. — foz
– foz, Commented Oct 21, 2015 at 10:40
@foz Oh - interesting! But I assume this wouldn't work in my case because I need numpy packages...? — n1000
– n1000, Commented Oct 21, 2015 at 11:38
@n1000 well I haven't tried it myself, but there is an example here that imports numpy inside a PL/Python function: github.com/ihuston/plpython_examples/blob/master/… — foz
– foz, Commented Oct 21, 2015 at 14:19

Clodoaldo Neto · Accepted Answer · 2015-10-21 11:36:06Z

5

unnest the data to make it all at once:

def create_random(i):
    return random() * i

curs.execute("select gid, number from t;")
data = curs.fetchall()

results = []
for i in data:
    results.append((create_random(i[1]), i[0]))

curs.execute("""
    update t
    set
        rnd = s.rnd,
        result = number * s.rnd
    from unnest(%s) s(rnd numeric, gid integer)
    where t.gid = s.gid;
""", (results,))

con.commit()

Table t:

create table t (
    gid integer,
    number integer,
    rnd float,
    result float
);

answered Oct 21, 2015 at 11:36

Clodoaldo Neto

127k30 gold badges250 silver badges274 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

n1000 Over a year ago

Wow - the UPDATE is 500% faster than using executemany. Thanks! So you think fetching all into RAM and putting it back with unnest is OK otherwise? I'm just afraid I might run out of RAM for larger tables and more columns. No need for for named cursors or something like that?

Clodoaldo Neto Over a year ago

@n1000 It is OK as long as you think it is OK to take that client/server round trip. A PL/Python function would make it much simpler than dealing with cursors and client/server round trips.

Collectives™ on Stack Overflow

How to efficiently UPDATE a column in a large PostgreSQL table using Python / psycopg2?

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related