I have a dataclass which is frequently used but is slow because as it processes multiple large datasets. I have tried to speed it up by making it lazily-evaluate, where the data is only read when requested and subsequent calls are cached.
Below is a (simplified) implementation for the variables x, y and z
import time, timeit
from functools import cache
class LazyDataStore:
def __init__(self): pass
@property
def x(self): return self.load_xy()["x"]
@property
def y(self): return self.load_xy()["y"]
@property
def z(self): return self.load_z()
@cache
def load_xy(self):
time.sleep(1) # simulate slow loading data
return {"x":1,"y":2} # simulate data
@cache
def load_z(self):
time.sleep(2) # simulate slow loading data
return 3 # simulate data
if __name__ == "__main__":
print(f'Time taken to access x, y and z once {timeit.timeit("my_data.x; my_data.y; my_data.z", setup="from __main__ import LazyDataStore; my_data = LazyDataStore()", number=1)}')
print(f'Time taken to access x 5 times {timeit.timeit("my_data.x", setup="from __main__ import LazyDataStore; my_data = LazyDataStore()", number=5)}')
print(f'Time taken to access x and z 100 times {timeit.timeit("my_data.x; my_data.z", setup="from __main__ import LazyDataStore; my_data = LazyDataStore()", number=100)}')
With the results printed below:
Time taken to access x, y and z once 3.0019894000142813 Time taken to access x 5 times 1.0117400998715311 Time taken to access x and z 100 times 3.0195651999674737
Is there a better/neater/cleaner way to do this? Any comments welcomed.
Some thoughts:
- I think
self.load_xy()["x"]isn't ideal, however it is concise, which helps as more variables are added the class starts to fill with boilerplate code - making it less readable. - Should I be using the @dataclass decorator in some way?
- I do use this same format in several files, so is there a clear/clean/useful way to make a Superclass/Subclass?