Optimizing the insertion of 400 000 records in sqlite

Question

I have the python script that creates a sqlite database and fills it with data (originally coming from a json file). At the time I execute the code below, my word table has about 400 000 entries.

con = sqlite3.connect("words.db")
cur = con.cursor()
form_of_words_to_add: "list[tuple(int, str)]" = []
# other code fills this list with also about 400 000 entries

index = 0
for form_of_entry in form_of_words_to_add:
    base_word = form_of_entry[1]
    word_id = form_of_entry[0]
    unaccented_word = unaccentify(form_of_entry[1])
    index += 1

    cur.execute("INSERT INTO form_of_word (word_id, base_word_id) \
SELECT ?, COALESCE ( \
(SELECT w.word_id FROM word w WHERE w.word = ?), \
(SELECT w.word_id FROM word w WHERE w.canonical_form = ?), \
(SELECT w.word_id FROM word w WHERE w.word = ?) \
)", (word_id, base_word, base_word, unaccented_word))

    if index == 1000:
        print(index)
        con.commit()
        index = 0

The code works, but it is very slow and only achieves about 15 insertions per second. I am looking for ideas to optimize it. The bottleneck appears to be the sql query, the rest of the loop takes almost no time in comparison once I comment the SQL out. Is there anything obvious I could do to optimize this very slow process? I am having a hard time thinking of a simpler query. I already tried using PyPy, but this did not increase performance.

The relevant entries in the database are as follows:

CREATE TABLE word 
(
    word_id INTEGER NOT NULL PRIMARY KEY,
    pos VARCHAR, --here had been pos_id
    canonical_form VARCHAR,
    romanized_form VARCHAR,
    genitive_form VARCHAR,
    adjective_form VARCHAR,
    nominative_plural_form VARCHAR,
    genitive_plural_form VARCHAR,
    ipa_pronunciation VARCHAR,
    lang VARCHAR,
    word VARCHAR,
    lang_code VARCHAR
);

CREATE TABLE form_of_word
(
    word_id INTEGER NOT NULL,
    base_word_id INTEGER,
    FOREIGN KEY(word_id) REFERENCES word(word_id),
    FOREIGN KEY(base_word_id) REFERENCES word(word_id)
);

@PavloSlavynskyy Ooh thank you, that was very much the answer. Now the entire code ran in seven seconds, instead of the seven hours it would have taken at original speed. — Pux
– Pux, Commented Jul 19, 2021 at 11:23
@PavloSlavynskyy Do you think you could add an explanation of why indexes help and expand your comment into a full answer? — pacmaninbw
– pacmaninbw ♦, Commented Jul 19, 2021 at 12:02
Yes, but I don't have time right now. Maybe in 10 hours or so — Pavlo Slavynskyy
– Pavlo Slavynskyy, Commented Jul 19, 2021 at 12:04
There was an article posted about this topic on Hacker News recently, in case you're curious. Article and discussion. — Greg Sadetsky
– Greg Sadetsky, Commented Jul 20, 2021 at 17:40

Reinderien · Accepted Answer · 2021-07-19 16:44:34Z

Other than @PavloSlavynskyy's correct recommendation to add indices:

Whereas autocommit is off by default for the Python library, I doubt your code is doing what you think it's doing because as soon as your first commit occurs autocommit is going to be re-enabled. Explicitly start a transaction at the beginning of your 1000-row pages to avoid this issue.
Rather than manually incrementing your loop index, use enumerate
SQLite connection objects are context managers - so use a with.
Rather than

    base_word = form_of_entry[1]
    word_id = form_of_entry[0]

you should unpack:

word_id, base_word = form_of_entry

Your code is going to miss a commit for the last segment of words in all cases except the unlikely one that the total number of words is a multiple of 1000. A trailing commit() should fix this.
Rather than modulating your index, consider just having a nested loop - which also would need neither an outside commit nor enumerate. There are many ways to do this, but basically:

for index in range(0, len(form_of_words_to_add), 1000):
    cur.execute('begin')
    for word_id, base_word in form_of_words_to_add[index: index+1000]:
        cur.execute('insert ...')
    con.commit()

Rather than doing individual, filtered inserts into the destination table, consider doing your inserts unfiltered into a temporary table, keeping them to the "literal" data (no where etc.). After all of the inserts are done, follow it with one single statement using proper joins, roughly looking like

insert into form_of_word(word_id, base_word_id)
select tw.word_id, w.word_id
from word w
join temp_words tw on (
    w.canonical_form = tw.base_word
    or w.word = tw.base_word
    or w.word = tw.unaccented_word
)

This temporary table can be told to live in memory via pragma.

Thank you very much for the answer! I implemented all of your suggestions except the temporary tables so far and got from 7 seconds (achieved after adding the indexing) to 4 seconds. I think the committing worked fine in my version of sqlite, when I accessed it through another DB navigator it updated exactly in the 1000-intervals. — Pux
– Pux, Commented Jul 19, 2021 at 18:26
You could also try enabling wal mode. That can sometimes speed things up significantly. See sqlite.org/wal.html — Blake
– Blake, Commented Jul 19, 2021 at 20:02

Pavlo Slavynskyy · Accepted Answer · 2021-07-20 04:24:34Z

Indexes!

Tables basically has array-like access: to find something, an engine should check all the values in a table. If you have a big table, this can be slow. But there's a solution: indexes. Database indexes work pretty much like book indexes, helping to find information quickly. But indexes are not free, they use some time and space too. So you should add indexes only for fields (or expressions) you're using to find data, in this case - word and canonical_form:

CREATE INDEX word_word_index ON word(word);
CREATE INDEX word_canonical_form_index ON word(canonical_form);

Keys are indexed by default. You can find more information about indexes on SQLite site.

Peter Badida · Accepted Answer · 2021-07-20 10:58:47Z

Split the data

When you are inserting, use multiprocessing.Pool() to distribute the workload by parallelising it = you push more data within the same time

Why 1000?

Don't use own 1000 limit, SQLite and other DBs allow much much more. Instead check the SQLITE_MAX_SQL_LENGTH and other limits to assemble queries efficiently as in push as much data as possible so it's:

quick on your network (if non-SQLite)
quick with your computer

You'll need to measure the time with e.g. inserting 1k-10k rows while adjusting the number of processes and inserted rows in a single query.

Stack Exchange Network

Optimizing the insertion of 400 000 records in sqlite

3 Answers 3

Indexes!

Split the data

Why 1000?

You must log in to answer this question.

Hot Network Questions

Optimizing the insertion of 400 000 records in sqlite

3 Answers 3

Indexes!

Split the data

Why 1000?

You must log in to answer this question.

Related

Hot Network Questions