I have the python script that creates a sqlite database and fills it with data (originally coming from a json file). At the time I execute the code below, my word table has about 400 000 entries.
con = sqlite3.connect("words.db")
cur = con.cursor()
form_of_words_to_add: "list[tuple(int, str)]" = []
# other code fills this list with also about 400 000 entries
index = 0
for form_of_entry in form_of_words_to_add:
base_word = form_of_entry[1]
word_id = form_of_entry[0]
unaccented_word = unaccentify(form_of_entry[1])
index += 1
cur.execute("INSERT INTO form_of_word (word_id, base_word_id) \
SELECT ?, COALESCE ( \
(SELECT w.word_id FROM word w WHERE w.word = ?), \
(SELECT w.word_id FROM word w WHERE w.canonical_form = ?), \
(SELECT w.word_id FROM word w WHERE w.word = ?) \
)", (word_id, base_word, base_word, unaccented_word))
if index == 1000:
print(index)
con.commit()
index = 0
The code works, but it is very slow and only achieves about 15 insertions per second. I am looking for ideas to optimize it. The bottleneck appears to be the sql query, the rest of the loop takes almost no time in comparison once I comment the SQL out. Is there anything obvious I could do to optimize this very slow process? I am having a hard time thinking of a simpler query. I already tried using PyPy, but this did not increase performance.
The relevant entries in the database are as follows:
CREATE TABLE word
(
word_id INTEGER NOT NULL PRIMARY KEY,
pos VARCHAR, --here had been pos_id
canonical_form VARCHAR,
romanized_form VARCHAR,
genitive_form VARCHAR,
adjective_form VARCHAR,
nominative_plural_form VARCHAR,
genitive_plural_form VARCHAR,
ipa_pronunciation VARCHAR,
lang VARCHAR,
word VARCHAR,
lang_code VARCHAR
);
CREATE TABLE form_of_word
(
word_id INTEGER NOT NULL,
base_word_id INTEGER,
FOREIGN KEY(word_id) REFERENCES word(word_id),
FOREIGN KEY(base_word_id) REFERENCES word(word_id)
);