helper defaults
def create_lag_and_delta_features(df_input, features_to_lag, temp_features_for_delta,
lag_period=1):
That last parameter is great, and self explanatory. There's no question about what kinds of inputs it accepts, and what a sensible input value would be.
Consider making two other params similarly self explanatory:
_default_features_to_lag = ("precipitation", "temp_max", "temp_min", "wind", "weather_encoded")
def create_lag_and_delta_features(
df_input,
features_to_lag=_default_features_to_lag,
temp_features_for_delta=("temp_max", "temp_min"),
lag_period=1,
):
...
This simplifies the call site, and makes reading the body of the helper a little easier since we can see expected param values right next to it.
As a fine point, we might adopt a _private name
for the _create_lag_and_delta_features function,
showing that it's not part of our Public API.
As an even finer point, I used an immutable
sequence of feature names where a list would
have been more natural.
Why?
To avoid the "mutable default arg"
gotcha
that lies in wait for the poor unsuspecting engineer.
And to ensure it would
lint cleanly.
overly eager helper
Thank you for adding that helper -- the code is much better off for it. But "and" is in the name, which suggests it might be doing too much, suggests it is not just doing one thing.
The current call site is nice, but this might be slightly nicer.
df_with_lags = _create_delta_feat(
_create_lag_feat(df_for_lagged_processing))
Notice that a pair of unit tests could now exercise both bits of functionality independently from one another.
train / test split
Imagine there was a weather model that was fairly accurate, but its poorest performance happened during cold months. We would want to know about its weaknesses before deploying into production.
Of the four years of available examples, you held out 20% to assess accuracy. Weather behavior is of course very seasonal. There's a couple of weeks with no assessment. Prefer a 75 - 25 split, so we're evaluating against a full year's worth of data.