10

I am trying to load an SQLAlchemy in a pandas DataFrame.

When I do:

df = pd.DataFrame(LPRRank.query.all())

I get

>>> df
0        <M. Misty || 1 || 18>
1        <P. Patch || 2 || 18>
...
...

But, what I want is each column in the database to be a column in the dataframe:

0        M. Misty  1  18
1        P. Patch  2  18
...
...

and when I try:

dff = pd.read_sql_query(LPRRank.query.all(), db.session())

I get an Attribute Error:

AttributeError: 'SignallingSession' object has no attribute 'cursor'

and

dff = pd.read_sql_query(LPRRank.query.all(), db.session)

also gives an error:

AttributeError: 'scoped_session' object has no attribute 'cursor'

What I'm using to generate the list of objects is:

app = Flask(__name__)
db = SQLAlchemy(app)

class LPRRank(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    candid = db.Column(db.String(40), index=True, unique=False)
    rank = db.Column(db.Integer, index=True, unique=False) 
    user_id = db.Column(db.Integer, db.ForeignKey('lprvote.id'))

    def __repr__(self):
        return '<{} || {} || {}>'.format(self.candid,
                                                 self.rank, self.user_id) 

This question: How to convert SQL Query result to PANDAS Data Structure? is error free, but gives each row as an object, which is not what I want. I can access the individual columns in the returned object, but its seems like there is a better way to do it.

The documentation at pandas.pydata.org is great if you already understand what is going on and just need to review syntax. The documentation from April 20, 2016 (the 1319 page pdf) identifies a pandas connection as still experimental on p.872.

Now, SQLALCHEMY/PANDAS - SQLAlchemy reading column as CLOB for Pandas to_sql is about specifying the SQL type. Mine is SQLAlchemy which is the default.

And, sqlalchemy pandas to_sql OperationalError, Writing to MySQL database with pandas using SQLAlchemy, to_sql, and SQLAlchemy/pandas to_sql for SQLServer -- CREATE TABLE in master db are about writing to the SQL database which produces an operational error, a database error, and a 'create table' error neither of which are my problems.

This one, SQLAlchemy Pandas read_sql from jsonb wants a jsonb attribute to columns: not my cup 'o tea.

This previous question SQLAlchemy ORM conversion to pandas DataFrame addresses my issue but the solution: using query.session.bind is not my solution. I'm opening /closing sessions with db.session.add(), and db.session.commit(), but when I use db.session.bind as specified in the second answer here, then I get an Attribute Error:

AttributeError: 'list' object has no attribute '_execute_on_connection'

3 Answers 3

25

Simply add an __init__ method in your model and call the Class object before dataframe build. Specifically below creates an iterable of tuples binded into columns with pandas.DataFrame().

class LPRRank(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    candid = db.Column(db.String(40), index=True, unique=False)
    rank = db.Column(db.Integer, index=True, unique=False) 
    user_id = db.Column(db.Integer, db.ForeignKey('lprvote.id'))

    def __init__(self, candid=None, rank=None, user_id=None):
        self.data = (candid, rank, user_id)

    def __repr__(self):
        return (self.candid, self.rank, self.user_id) 

data = db.session.query(LPRRank).all()
df = pd.DataFrame([(d.candid, d.rank, d.user_id) for d in data], 
                  columns=['candid', 'rank', 'user_id'])

Alternatively, use the SQLAlchemy ORM based on your defined Model class, LPRRank, to run read_sql:

df = pd.read_sql(sql = db.session.query(LPRRank)\
                         .with_entities(LPRRank.candid,
                                        LPRRank.rank,
                                        LPRRank.user_id).statement, 
                 con = db.session.bind)
Sign up to request clarification or add additional context in comments.

5 Comments

thanks for pointing out the formatting. I took that directly from Miguel Grinberg's Flask tutorial, which is very good. His formatting mirrors the SQLAlchemy documentation, and that is consistent with the error I get with df = pd.DataFrame(LPRRank.query.all(), columns=['candid', 'rank', 'user_id']):
... the error is ValueError: Shape of passed values is (1, 113), indices imply (3, 113). It's looking for a string, but getting a tuple. Regarding the second solution, I get an: sqlalchemy.exc.ObjectNotExecutableError: Not an executable object: <flask_sqlalchemy.BaseQuery object at 0x1134dc860>
See update using __init__ in above call and add .statement to sql arg of below call. Both tested with no errors.
They both work. Thanks! I particularly like being able to do it without looping, even in a list comprehension. However the read_sql includes the SQLAlchemy ID thereby creating a redundancy with the indexing of the DataFrame. I can eliminate the redundancy by manipulating the DataFrame, but can it be withheld in the read_sql command?
Awesome! And indeed you can select specific columns using with_entities. See update. Or use index_col='id' arg in read_sql to select primary key field as index.
1

The Parfait answer is good but could have to problems:

  1. efficiency each object creation imply duplication of data into a DataFrame, so a list of dataframe could take time to be created
  2. That do not mirror a dataframe with a collection of row

Thus below example provides a parent class which is assimilated to a DataFrame representation and a child class assimilated to row of a given dataframe.

Code below provides two way to get a dataframe, the DataFrame object is created only at demand to not waste cpu and memory.

If dataframe is need at creation time you have only to add constructor (def __init__(self, rows:List[MyDataFrameRow] = None)...) and create a new attribute and assing the result of self.data_frame.

from pandas import DataFrame, read_sql
from sqlalchemy import Column, Integer, String, Float, ForeignKey
from sqlalchemy.orm import relationship, Session

Base = declarative_base()

class MyDataFrame(Base):
    __tablename__ = 'my_data_frame'
    id = Column(Integer, primary_key=True)
    rows = relationship('MyDataFrameRow', cascade='all,delete')

    @property
    def data_frame(self) -> DataFrame:
        columns = GenomeCoverageRow.data_frame_columns()
        return DataFrame([[getattr(row, column) for column in columns] for row in self.rows],
                         columns=columns)

    @staticmethod
    def to_data_frame(identifier: int, session: Session) -> DataFrame:
        query = session.query(MyDataFrameRow).join(MyDataFrame).filter(MyDataFrame.id == identifier)
        return read_sql(query.statement, session.get_bind())


class MyDataFrameRow(Base):

    __tablename__ = 'my_data_row'
    id = Column(Integer, primary_key=True)
    name= Column(String)
    age= Column(Integer)
    number_of_children = Column(Integer)
    height= Column(Integer)
    parent_id = Column(Integer, ForeignKey('my_data_frame.id'))

    @staticmethod
    def data_frame_columns() -> Tuple[Any]:
        return tuple(column.name for column in GenomeCoverageRow.__table__.columns if len(column.foreign_keys) == 0
                     and column.primary_key is False)
...
session = Session(...)
df1 = MyDataFrame.to_data_frame(1,session)
my_table_obj = session.query(MyDataFrame).filter(MyDataFrame.id == 1).one()
df2 = my_table_obj.data_frame

Comments

0

I'm using flask-sqlalchemy with reflection to build my models but this worked for me:

import pandas as pd
from app.models import Runs
from app import db

def get_db_data_df():
    df_runs = pd.read_sql(Runs.__table__.name, con=db.get_engine(), index_col=None)
    return df_runs

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.