Problem
I want to read in the data to dictionary
person = {
'name': 'John Doe',
'email': '[email protected]',
'age': 50,
'connected': False
}
The data comes from different formats:
Format A.
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': '[email protected]',
'age': 50,
'connected': False
}
Format B.
dict_b = {
'fullName': 'John Doe',
'workEmail': '[email protected]',
'age': 50,
'connected': False
}
There will be additional sources added in the future with additional structures.
Background
For this specific case, I'm building a Scrapy spider that scrapes the data from different APIs and web pages. Scrapy's recommended way would be to use their Item or ItemLoader, but it's ruled out in my case.
There could be potentially 5-10 different structures from which the data will be read from.
Implementation
/database/models.py
"""
Database mapping declarations for SQLAlchemy
"""
from sqlalchemy import Column, Integer, String, Boolean
from database.connection import Base
class PersonModel(Base):
__tablename__ = 'Person'
id = Column(Integer, primary_key=True)
name = Column(String)
email = Column(String)
age = Column(Integer)
connected = Column(Boolean)
/mappers/person.py
"""
Data mappers for Person
"""
# Abstract class for mapper
class Mapper(object):
def __init__(self, data):
self.data = data
# Data mapper for format A, maps the fields from dict_a to Person
class MapperA(Mapper):
def __init__(self, data):
self.name = ' '.join(data.get('name', {}).get(key) for key in ('first_name', 'last_name'))
self.email = data.get('workEmail')
self.age = data.get('age')
self.connected = data.get('connected')
@classmethod
def is_mapper_for(cls, data):
needed = {'name', 'workEmail'}
return needed.issubset(set(data))
# Data mapper for format B, maps the fields from dict_b to Person
class MapperB(Mapper):
def __init__(self, data):
self.name = data.get('fullName')
self.email = data.get('workEmail')
self.age = data.get('age')
self.connected = data.get('connected')
@classmethod
def is_mapper_for(cls, data):
needed = {'fullName', 'workEmail'}
return needed.issubset(set(data))
# Creates a Person instance base on the input data mapping
def Person(data):
for cls in Mapper.__subclasses__():
if cls.is_mapper_for(data):
return cls(data)
raise NotImplementedError
if __name__ == '__main__':
from database.connection import make_session
from database.models import PersonModel
# Sample data for example
dict_a = {
'name': {
'first_name': 'John',
'last_name': 'Doe'
},
'workEmail': '[email protected]',
'age': 50,
'connected': False
}
dict_b = {
'fullName': 'John Doe',
'workEmail': '[email protected]',
'age': 50,
'connected': False
}
# Instantiate Person from data
persons = [PersonModel(**Person(data).__dict__ for data in (dict_a, dict_b)]
with make_session() as session:
session.add_all(persons)
session.commit()
Question
I have limited experience in Python programming and I'm building my first scraper application for a data engineering project that needs to scale to storing hundreds of thousands of Persons from tens of different structures. I was wondering if:
- This is a good solution? What could be the drawbacks and problems down the line?
- Currently I've implemented different subclasses for the mapping. Is there a convention or industry standard for these types of situations?
Update
- For question 2, I found this question to be useful, but would still want to know if this approach in general is good.
- Added style improvement suggestions from @Reinderien