6

Setup

Two tables: schools and students. The index (or keys) in SQLite will be id and time for the students table and school and time for the schools table. My dataset is about something different, but I think the school-student example is easier to understand.

import pandas as pd
import numpy as np
import sqlite3

df_students = pd.DataFrame(
{'id': list(range(0,4)) + list(range(0,4)),
'time': [0]*4 + [1]*4, 'school': ['A']*2 + ['B']*2 + ['A']*2 + ['B']*2,
'satisfaction': np.random.rand(8)} )
df_students.set_index(['id', 'time'], inplace=True)

        satisfaction    school
id  time        
0   0   0.863023    A
1   0   0.929337    A
2   0   0.705265    B
3   0   0.160457    B
0   1   0.208302    A
1   1   0.029397    A
2   1   0.266651    B
3   1   0.646079    B

df_schools = pd.DataFrame({'school': ['A']*2 + ['B']*2, 'time': [0]*2 + [1]*2, 'mean_scores': np.random.rand(4)})
df_schools.set_index(['school', 'time'], inplace=True)
df_schools


               mean_scores
school  time    
A       0     0.358154
A       0     0.142589
B       1     0.260951
B       1     0.683727

## Send to SQLite3

conn = sqlite3.connect('schools_students.sqlite')

df_students.to_sql('students', conn)
df_schools.to_sql('schools', conn)

What do I need to do?

I have a bunch of functions that operate over pandas dataframes and create new columns that should then be inserted in either the schools or the students table (depending on what I'm constructing). A typical function does, in order:

  1. Queries columns from both SQL tables
  2. Uses pandas functions such as groupby, apply of custom functions, rolling_mean, etc. (many of them not available on SQL, or difficult to write) to construct a new column. The return type is either pd.Series or np.array
  3. Adds the new column to the appropriate dataframe (schools or students)

These functions were written when I had a small database that fitted in memory so they are pure pandas.

Here's an example in pseudo-code:

def example_f(satisfaction, mean_scores)
    """Silly function that divides mean satisfaction per school by mean score"""
    #here goes the pandas functions I already wrote
    mean_satisfaction = mean(satisfaction) 
    return mean_satisfaction/mean_scores

satisf_div_score = example_f(satisfaction, mean_scores)
# Here push satisf_div_score to `schools` table

Because my dataset is really large, I'm not able to call these functions in memory. Imagine that schools are located in different districts. Originally I only had one district, so I know these functions can work with data from each district separately.

A workflow that I think would work is:

  • Query relevant data for district i
  • Apply function to data for district i and produce new columns as np.array or pd.Series
  • Insert this column at the appropriate table (would fill data for district i of that columns
  • Repeat for districts from i = 1 to K

Although my dataset is in SQLite (and I'd prefer it to stay that way!) I'm open to migrating it to something else if the benefits are large.


I realize there are different reasonable answers, but it would be great to hear something that has proved useful and simple for you. Thanks!

5
  • 1
    I'm a little curios. How big are your tables? Usually pandas can handle MANY (millions) entries before breaking down. You mention using apply, which is generally bad for performance, although sometimes necessary. If you are really running out of memory, your code might not be optimized. Look for variables that hold intermediate results and get rid of them by assigning them to the final columns as soon as possible. Otherwise your suggested workflow sounds reasonable, if that is the lowest level where you can slice it. Commented Nov 15, 2016 at 11:34
  • 1
    If you do not want to change tools selected you may split you dataset into parts for example by district and/or school number. Thus you will obtain all derivative values mean, average and so on for small piece of data, which could well feat in memory and will be calculated quickly. Then in psuedoloop like for s_id in schools: data=get_data_for_school(s_id); calc(data); write_to_sql(s_id, data) Commented Nov 15, 2016 at 12:33
  • Also if your date is really big I'd like to consider using some other databases: for example PostgreSQL. It works well with huge data not fitting in memory, and have a special window functions to perform calculations of rolling averages, places and so on. Maybe it can solve all your tasks without pandas. Feel free to ask questions. Commented Nov 15, 2016 at 12:39
  • @EugeneLisitsky: if you want to expand your ideas (with as much detail as you can), then I can give you the bounty. I put up an example data to make that process easier. I think a more detailed answer would benefit other people too! Commented Nov 21, 2016 at 17:34
  • @AskeDoerge : same goes for you if you're interested. Commented Nov 21, 2016 at 17:34

1 Answer 1

1

There are several approaches, you may select which are better for your particular task:

  1. Move all data to "bigger" database. Personally I prefer PostgreSQL - it plays very well with big datasets. Fortunately pandas support SQLAlchemy - cross-database ORM, so you may use the same queries with different databases.

  2. Split data into chunks and calculate for any chunk separately. I'll demo it with PostgreSQL, but you may use any DB.

    from sqlalchemy import create_engine
    import psycopg2
    mydb = create_engine('postgresql://[email protected]:5432/database')
    # lets select some groups of data into first dataframe, 
    # you may use school ids instead of my sections
    df=pd.read_sql_query('''SELECT sections, count(id) FROM table WHERE created_at <'2016-01-01' GROUP BY sections ORDER BY 2 DESC LIMIT 10''', con=mydb)
    print(df)  # don't worry about strange output - sections have type int[] and it's supported well!
    
       sections     count
    0  [121, 227]  104583
    1  [296, 227]   48905
    2  [121]        43599
    3  [302, 227]   29684 
    4  [298, 227]   26814
    5  [294, 227]   24071
    6  [297, 227]   23038
    7  [292, 227]   22019
    8  [282, 227]   20369
    9  [283, 227]   19908
    
    # Now we have some sections and we can select only data related to them
    for section in df['sections']:
       df2 = pd.read_sql_query('''SELECT sections, name, created_at, updated_at, status 
                                  FROM table 
                                  WHERE created_at <'2016-01-01'   
                                      AND sections=%(section)s 
                                  ORDER BY created_at''', 
                               con=mydb, params=dict(section=section))
        print(section, df2.std())
    
    [121, 227] status    0.478194
    dtype: float64
    [296, 227] status    0.544706
    dtype: float64
    [121] status    0.499901
    dtype: float64
    [302, 227] status    0.504573
    dtype: float64
    [298, 227] status    0.518472
    dtype: float64
    [294, 227] status    0.46254
    dtype: float64
    [297, 227] status    0.525619
    dtype: float64
    [292, 227] status    0.627244
    dtype: float64
    [282, 227] status    0.362891
    dtype: float64
    [283, 227] status    0.406112
    dtype: float64
    

    Of course this example is synthetic - it's quite ridiculous to calculate average status on articles :) But it demonstrates how to split lots of data and treat it in portions.

  3. Use specific PostgreSQL (or Oracle or MS or any you like) for statistics. Here's excellent documentations on Window Functions in PostgreSQL. Luckily you may perform some calcs in DB and move prefabbed data to DataFrame as above.

UPDATE: How to load information back to database.

Fortunately, DataFrame support method to_sql to make this process easy:

from sqlalchemy import create_engine
mydb = create_engine('postgresql://[email protected]:5432/database')
df2.to_sql('tablename', mydb, if_exists='append', chunksize=100)

You may specify action you need: if_exists='append' add rows to table, if you have a lot of rows you may split them to chunks, so db could insert them.

Sign up to request clarification or add additional context in comments.

6 Comments

Eugene, thanks for your answer. Do you have any recommendation on workflows for creating a new columns and filling it section by section? I was thinking of something like query=''' insert or replace into NewTable (ID,Name,Age) values (?,?,?) ''' conn.executemany(query, df2.to_records(index=False)) If you can edit your response to include that I'll definitely accept it, thanks.
I've updated the recipe. Also pls have a look at parameter if_exists - it defines action if table already exists. I think better way is to avoid duplicates, then try to resolve them in DB.
Thanks! I'm a little worried about if_exists='append' in case those records already exist. That's why I've tried with insert or replace into
Also there's option replace: If table exists, drop it, recreate it, and insert data.
SQLAlchemy supports "UPSERT" feature of PostgreSQL 9.5+. So you can try to insert data into db and automatically update existing rows. But I've never tested it. More info: docs.sqlalchemy.org/en/latest/dialects/…
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.