5
\$\begingroup\$

I have a dataclass which is frequently used but is slow because as it processes multiple large datasets. I have tried to speed it up by making it lazily-evaluate, where the data is only read when requested and subsequent calls are cached.

Below is a (simplified) implementation for the variables x, y and z

import time, timeit
from functools import cache

class LazyDataStore:
    def __init__(self): pass

    @property
    def x(self): return self.load_xy()["x"]

    @property
    def y(self): return self.load_xy()["y"]

    @property
    def z(self): return self.load_z()

    @cache
    def load_xy(self):
        time.sleep(1)  # simulate slow loading data
        return {"x":1,"y":2}  # simulate data

    @cache
    def load_z(self):
        time.sleep(2)  # simulate slow loading data
        return 3  # simulate data

if __name__ == "__main__":
    print(f'Time taken to access x, y and z once {timeit.timeit("my_data.x; my_data.y; my_data.z", setup="from __main__ import LazyDataStore; my_data = LazyDataStore()", number=1)}')
    print(f'Time taken to access x 5 times {timeit.timeit("my_data.x", setup="from __main__ import LazyDataStore; my_data = LazyDataStore()", number=5)}')
    print(f'Time taken to access x and z 100 times {timeit.timeit("my_data.x; my_data.z", setup="from __main__ import LazyDataStore; my_data = LazyDataStore()", number=100)}')

With the results printed below:

Time taken to access x, y and z once 3.0019894000142813
Time taken to access x 5 times 1.0117400998715311
Time taken to access x and z 100 times 3.0195651999674737

Is there a better/neater/cleaner way to do this? Any comments welcomed.

Some thoughts:

  • I think self.load_xy()["x"] isn't ideal, however it is concise, which helps as more variables are added the class starts to fill with boilerplate code - making it less readable.
  • Should I be using the @dataclass decorator in some way?
  • I do use this same format in several files, so is there a clear/clean/useful way to make a Superclass/Subclass?
\$\endgroup\$
3
  • 1
    \$\begingroup\$ Welcome to Code Review@SE. If you presented some actual resource hungry methods, you might get advice tackling that misfeature directly. Do you know about The Future? \$\endgroup\$ Commented Feb 7, 2022 at 4:29
  • 1
    \$\begingroup\$ The Code Review community operates on different principles that the Stack Overflow community. On Stack Overflow they want to see the minimum reproducible case to help debug. Since we assume the code is working as expected and want to help solve performance issues we need to see more code to be able to help with the performance issues, we don't like simplified versions because they aren't the actual code. \$\endgroup\$ Commented Feb 7, 2022 at 15:09
  • 1
    \$\begingroup\$ @pacmaninbw thank you for the feedback. While I would definitely appreciate feedback on the full code, the time.sleep sections in this question represent io-bound operations on multi-GB (sometimes TB) offline data and I have yet to come up with a reasonable way to share a reproducer online. \$\endgroup\$ Commented Feb 7, 2022 at 21:19

1 Answer 1

8
\$\begingroup\$

Did you consider functools.cached_property? Seems like it was designed for this use case. It doesn't appear to be any faster, but maybe the intent of the code is clearer.

from functools import cached_property

class LazyDataStore:
    def __init__(self): pass

    @cached_property
    def x(self):
        self.load_xy()
        return self.x

    @cached_property
    def y(self):
        self.load_xy()
        return self.y

    @cached_property
    def z(self):
        self.load_z()
        return self.z

    def load_xy(self):
        time.sleep(1)  # simulate slow loading data
        self.x = 1     # simulate data
        self.y = 2

    def load_z(self):
        time.sleep(2)  # simulate slow loading data
        self.z = 3     # simulate data
\$\endgroup\$
4
  • \$\begingroup\$ This will be slower if you need both x and y, unless you @cache the method load_xy again. \$\endgroup\$ Commented Feb 7, 2022 at 6:21
  • 4
    \$\begingroup\$ @Graipher, When you do a get on a cached_property, it checks to see if an attribute with the same (e.g., self.y) has been set to a value. If it has been set, the value is returned. If it hasn't been set, the decorated method is called. Because load_xy() sets both self.x and self.y accessing x then y only calls load_xy() once. \$\endgroup\$ Commented Feb 7, 2022 at 7:03
  • 1
    \$\begingroup\$ Worth noting that because of this feature of @cached_property, they do not preserve the sometimes desirable read-only nature of @property. For such cases, it's better to retain OP's method. (In many cases, you could alternatively layer @property atop @cache. However, in this case, it would encounter the issue that Graipher raises: the single loading function would be run twice.) \$\endgroup\$ Commented Feb 7, 2022 at 7:55
  • \$\begingroup\$ Thanks @RootTwo, I think this is a clearer way of doing it. As other comments point out, it does have write-access, however I don't foresee any (unintentional) cases where this would be an issue. \$\endgroup\$ Commented Feb 7, 2022 at 21:02

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.