Setup
Two tables: schools and students. The index (or keys) in SQLite will be id and time for the students table and school and time for the schools table. My dataset is about something different, but I think the school-student example is easier to understand.
import pandas as pd
import numpy as np
import sqlite3
df_students = pd.DataFrame(
{'id': list(range(0,4)) + list(range(0,4)),
'time': [0]*4 + [1]*4, 'school': ['A']*2 + ['B']*2 + ['A']*2 + ['B']*2,
'satisfaction': np.random.rand(8)} )
df_students.set_index(['id', 'time'], inplace=True)
satisfaction school
id time
0 0 0.863023 A
1 0 0.929337 A
2 0 0.705265 B
3 0 0.160457 B
0 1 0.208302 A
1 1 0.029397 A
2 1 0.266651 B
3 1 0.646079 B
df_schools = pd.DataFrame({'school': ['A']*2 + ['B']*2, 'time': [0]*2 + [1]*2, 'mean_scores': np.random.rand(4)})
df_schools.set_index(['school', 'time'], inplace=True)
df_schools
mean_scores
school time
A 0 0.358154
A 0 0.142589
B 1 0.260951
B 1 0.683727
## Send to SQLite3
conn = sqlite3.connect('schools_students.sqlite')
df_students.to_sql('students', conn)
df_schools.to_sql('schools', conn)
What do I need to do?
I have a bunch of functions that operate over pandas dataframes and create new columns that should then be inserted in either the schools or the students table (depending on what I'm constructing). A typical function does, in order:
- Queries columns from both SQL tables
- Uses
pandasfunctions such asgroupby,applyof custom functions,rolling_mean, etc. (many of them not available on SQL, or difficult to write) to construct a new column. The return type is eitherpd.Seriesornp.array - Adds the new column to the appropriate dataframe (
schoolsorstudents)
These functions were written when I had a small database that fitted in memory so they are pure pandas.
Here's an example in pseudo-code:
def example_f(satisfaction, mean_scores)
"""Silly function that divides mean satisfaction per school by mean score"""
#here goes the pandas functions I already wrote
mean_satisfaction = mean(satisfaction)
return mean_satisfaction/mean_scores
satisf_div_score = example_f(satisfaction, mean_scores)
# Here push satisf_div_score to `schools` table
Because my dataset is really large, I'm not able to call these functions in memory. Imagine that schools are located in different districts. Originally I only had one district, so I know these functions can work with data from each district separately.
A workflow that I think would work is:
- Query relevant data for district
i - Apply function to data for district
iand produce new columns as np.array or pd.Series - Insert this column at the appropriate table (would fill data for district
iof that columns - Repeat for districts from
i= 1 toK
Although my dataset is in SQLite (and I'd prefer it to stay that way!) I'm open to migrating it to something else if the benefits are large.
I realize there are different reasonable answers, but it would be great to hear something that has proved useful and simple for you. Thanks!
apply, which is generally bad for performance, although sometimes necessary. If you are really running out of memory, your code might not be optimized. Look for variables that hold intermediate results and get rid of them by assigning them to the final columns as soon as possible. Otherwise your suggested workflow sounds reasonable, if that is the lowest level where you can slice it.for s_id in schools: data=get_data_for_school(s_id); calc(data); write_to_sql(s_id, data)