cursor.execute vs cursor.executemany minimal synthetic benchmark
main.py
from pathlib import Path
import sqlite3
import csv
f = 'tmp.sqlite'
n = 10000000
Path(f).unlink(missing_ok=True)
connection = sqlite3.connect(f)
#connection.set_trace_callback(print)
cursor = connection.cursor()
cursor.execute("CREATE TABLE t (x integer)")
#for i in range(n):
# cursor.execute(f"INSERT INTO t VALUES ({i})")
cursor.executemany(f"INSERT INTO t VALUES (?)", ((str(i),) for i in range(n)))
connection.commit()
connection.close()
Results according to time main.py:
| method |
storage |
time (s) |
executemany |
SSD |
8.6 |
executemany |
:memory: |
9.8 |
executemany |
HDD |
10.4 |
execute |
SSD |
31 |
execute |
:memory: |
29 |
Baseline time of for i in range(n): pass: 0.3s.
Conclusions:
executemany is about 3x faster
- we are not I/O bound on
execute vs executemany, we are likely memory bound. This might not be very surprising given that the final tmp.sqlite file is only 115 MB, and that my SSD can handle 3 GB/s
If I enable connection.set_trace_callback(print) to log queries as per: How can I log queries in Sqlite3 with Python? and reduce n = 5 I see the exact same queries for both execute and executemany:
CREATE TABLE t (x integer)
BEGIN
INSERT INTO t VALUES ('0')
INSERT INTO t VALUES ('1')
INSERT INTO t VALUES ('2')
INSERT INTO t VALUES ('3')
INSERT INTO t VALUES ('4')
COMMIT
so it does not seem that the speed difference is linked to transactions, as both appear to execute within a single transaction. There are some comments on automatic transaction control at: https://docs.python.org/3/library/sqlite3.html#transaction-control but they are not super clear, I hope the logs are correct.
Insertion time baseline
I'll be looking out for the fastest possible method I can find to compare it to Python. So far, this generate_series approach is the winner:
f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" 'create table t(x integer)'
time sqlite3 "$f" 'insert into t select value as x from generate_series(1,10000000)'
finished as fast as 1.4s on SSD, therefore substantially faster than any Python method so far, and well beyond being SSD-bound.
Tested on Ubuntu 23.04, Python 3.11.2, Lenovo ThinkPad P51
- SSD: Samsung MZVLB512HAJQ-000L7 512GB SSD, 3 GB/s nominal speed
- HDD: Seagate ST1000LM035-1RK1 1TB, 140 MB/s nominal speed