Grouped pmdarima
The Grouped pmdarima
API is a multi-series orchestration framework for building multiple individual models
of related, but isolated series data. For example, a project that required the forecasting of inventory demand at
regional warehouses around the world would historically require individual orchestration of data acquisition, hyperparameter
definitions, model training, metric validation, serialization, and registration of tens of thousands of individual
models based on the permutations of SKU and warehouse location.
This API consolidates the many thousands of models that would otherwise need to be implemented, trained individually, and managed throughout their frequent retraining and forecasting lifecycles to a single high-level API that simplifies these common use cases that rely on the pmdarima forecasting library.
Table of Contents
Grouped pmdarima API
The following sections provide a basic overview of using the GroupedPmdarima
API,
from fitting of the grouped models, predicting forecasted data, saving, loading, and customization of the underlying
pmdarima
instances.
To see working end-to-end examples, you can go to Tutorials and Examples. The examples will allow you to explore the data structures required for training, how to extract forecasts for each group, and demonstrations of the saving and loading of trained models.
Base Estimators and API interface
The usage of the GroupedPmdarima
API is slightly different from the other grouped
forecasting library wrappers within Diviner
. This is due to the ability of pmdarima
to support
multiple modes of configuration.
These modes that are available to construct a model are:
Passing an
ARIMA
model template (wrapper around statsmodels ARIMA)Using the native
pmdarima
AutoARIMA
model templateConstructing a
pmdarima
Pipeline template
The GroupedPmdarima
implementation requires the submission of one of these 3
model templates to set the base configured model architecture for each group.
For example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
# Define the base ARIMA with a preset ordering parameter
base_arima_model = ARIMA(order=(1, 0, 2))
# Define the model template in the GroupedPmdarima constructor
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
The above example is intended only to showcase the interface between a base estimator (base_arima_model
) and the
instance constructor for GroupedPmdarima. For a more in-depth and realistic example of utilizing an ARIMA model manually,
see the additional statistical validation steps that would be required for this in the Tutorials and Examples section
of the docs.
Model fitting
In order to fit a GroupedPmdarima
model instance, the fit
method is used. Calling this method will process the input DataFrame
to create a grouped execution collection,
fit a pmdarima
model type on each individual series, and persist the trained state of each group’s model to the
object instance.
The arguments for the fit
method are:
- df
A ‘normalized’ DataFrame that contains an endogenous regressor column (the ‘y’ column), a date (or datetime) column (that defines the ordering, periodicity, and frequency of each series (if this column is a string, the frequency will be inferred)), and grouping column(s) that define the discrete series to be modeled. For further information on the structure of this
DataFrame
, see the quickstart guide- group_key_columns
The names of the columns within
df
that, when combined (in order supplied) define distinct series. See the quickstart guide for further information.- y_col
Name of the endogenous regressor term within the
DataFrame
argumentdf
. This column contains the values of the series that are used during training.- datetime_col
Name of the column within the
df
argumentDataFrame
that defines the datetime ordering of the series data.- exog_cols
[Optional] A collection of column names within the submitted data that contain exogenous regressor elements to use as part of model fitting and predicting. The data within each column will be assembled into a 2D array for use in the regression.
Note
pmdarima
currently has exogeneous regressor support marked as a future deprecated feature. Usage of this
functionality is not recommended except for existing legacy implementations.
- ndiffs
[Optional] A dictionary of
{<group_key>: <d value>}
for the differencing term for each group. This is intended to function alongside the output from thediviner.PmdarimaAnalyzer.calculate_ndiffs()
method, serving to reduce the search space forAutoARIMA
by supplying fixedd
values to each group’s model.- nsdiffs
[Optional] A dictionary of
{<group_key>: <D value>}
for the seasonal differencing term for each group. This is intended to function alongside the output from thediviner.PmdarimaAnalyzer.calculate_nsdiffs()
method, serving to reduce the search space forAutoARIMA
by supplying fixedD
values to each group’s model.
Note
These values will only be used if the models being fit are seasonal models. The value m
must be set on the
underlying ARIMA
or AutoARIMA
model for seasonality order components to be used.
- silence_warnings
[Optional] Whether to silence stdout reporting of the underlying
pmdarima
fit process. Default: False.- fit_kwargs
[Optional]
fit_kwargs
forpmdarima
ARIMA
,AutoARIMA
, orPipeline
stages overrides. For more information, see thepmdarima
docs
Example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
base_arima_model = ARIMA(order=(1, 0, 2))
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
Predict
The predict
method generates forecast data for each grouped series within
the meta diviner.GroupedPmdarima
model.
Example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
base_arima_model = ARIMA(order=(1, 0, 2))
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
forecasts = grouped_arima_model.predict(n_periods=30)
The arguments for the predict
method are:
- n_periods
The number of future periods to generate from the end of each group’s series. The first value of the prediction forecast series will begin at one periodicity value after the end of the training series. For example, if the training series was of daily data from 2019-10-01 to 2021-10-02, the start of the prediction series output would be 2021-10-03 and continue for
n_periods
days from that point.- predict_col
[Optional] The name to use for the generated column containing forecasted data. Default:
"yhat"
- alpha
[Optional] Confidence interval significance value for error estimates. Default:
0.05
.
Note
alpha
is only used if the boolean flag return_conf_int
is set to True
.
- return_conf_int
[Optional] Boolean flag for whether or not to calculate confidence intervals for the predicted forecasts. If
True
, the columns"yhat_upper"
and"yhat_lower"
will be added to the outputDataFrame
for the upper and lower confidence intervals for the predictions.- inverse_transform
[Optional] Used exclusively for
Pipeline
based models that include an endogeneous transformer such asBoxCoxEndogTransformer
orLogEndogTransformer
. Default:True
(although it only applies if themodel_template
type passed in is aPipeline
that contains a transformer). An inversion of the endogeneous regression term can be helpful for distributions that are highly non-normal. For further reading on what the purpose of these functions are, why they are used, and how they might be applicable to a given time series, see data transformation.- exog
[Optional] If the original model was trained with an exogeneous regressor elements, the prediction will require these 2D arrays at prediction time. This argument is used to hold the 2D array of future exogeneous regressor values to be used in generating the prediction for the regressor.
- predict_kwargs
[Optional] Extra
kwarg
arguments for any of the transform stages of aPipeline
or for additionalpredict
kwargs
to the model instance.Pipeline
kwargs
are specified in the manner ofsklearn
Pipeline
format (i.e.,<stage_name>__<arg name>=<value>
. e.g., to change the values of a fourier transformer at prediction time, the override would be:{'fourier__n_periods': 45})
Predict Groups
The predict_groups
method generates forecast data for a subset of
groups that a diviner.GroupedPmdarima
model was trained upon.
Example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
base_model = ARIMA(order=(2, 1, 2))
grouped_arima = GroupedPmdarima(model_template=base_model)
model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
subset_forecasts = model.predict_groups(groups=[("US", "NY"), ("FR", "Paris"), ("UA", "Kyiv")], n_periods=90)
The arguments for the predict_groups
method are:
- groups
A collection of one or more groups for which to generate a forecast. The collection of groups must be submitted as a
List[Tuple[str]]
to identify the order-specific group values to retrieve the correct model. For instance, if the model was trained with the specifiedgroup_key_columns
of["country", "city"]
, a validgroups
entry would be:[("US", "LosAngeles"), ("CA", "Toronto")]
. Changing the order within the tuples will not resolve (e.g.[("NewYork", "US")]
would not find the appropriate model).Note
Groups that are submitted for prediction that are not present in the trained model will, by default, cause an Exception to be raised. This behavior can be changed to a warning or ignore status with the argument
on_error
.- n_periods
The number of future periods to generate from the end of each group’s series. The first value of the prediction forecast series will begin at one periodicity value after the end of the training series. For example, if the training series was of daily data from 2019-10-01 to 2021-10-02, the start of the prediction series output would be 2021-10-03 and continue for
n_periods
days from that point.- predict_col
[Optional] The name to use for the generated column containing forecasted data. Default:
"yhat"
- alpha
[Optional] Confidence interval significance value for error estimates. Default:
0.05
.
Note
alpha
is only used if the boolean flag return_conf_int
is set to True
.
- return_conf_int
[Optional] Boolean flag for whether or not to calculate confidence intervals for the predicted forecasts. If
True
, the columns"yhat_upper"
and"yhat_lower"
will be added to the outputDataFrame
for the upper and lower confidence intervals for the predictions.- inverse_transform
[Optional] Used exclusively for
Pipeline
based models that include an endogeneous transformer such asBoxCoxEndogTransformer
orLogEndogTransformer
. Default:True
(although it only applies if themodel_template
type passed in is aPipeline
that contains a transformer). An inversion of the endogeneous regression term can be helpful for distributions that are highly non-normal. For further reading on what the purpose of these functions are, why they are used, and how they might be applicable to a given time series, see this link.- exog
[Optional] If the original model was trained with an exogeneous regressor elements, the prediction will require these 2D arrays at prediction time. This argument is used to hold the 2D array of future exogeneous regressor values to be used in generating the prediction for the regressor.
- on_error
[Optional] [Default ->
"raise"
] Dictates the behavior for handling group keys that have been submitted in thegroups
argument that do not match with a group identified and registered during training (fit
). The modes are:"raise"
A
DivinerException
is raised if any supplied groups do not match to the fitted groups.
"warn"
A warning is emitted (printed) and logged for any groups that do not match to those that the model was fit with.
"ignore"
Invalid groups will silently fail prediction.
Note
A
DivinerException
will still be raised even in"ignore"
mode if there are no valid fit groups to match the providedgroups
provided to this method.- predict_kwargs
[Optional] Extra
kwarg
arguments for any of the transform stages of aPipeline
or for additionalpredict
kwargs
to the model instance.Pipeline
kwargs
are specified in the manner ofsklearn
Pipeline
format (i.e.,<stage_name>__<arg name>=<value>
. e.g., to change the values of a fourier transformer at prediction time, the override would be:{'fourier__n_periods': 45})
Save
Saves a GroupedPmdarima
instance that has been fit
.
The serialization of the model instance uses a base64 encoding of the pickle serialization of each model instance within
the grouped structure.
Example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
base_arima_model = ARIMA(order=(1, 0, 2))
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
save_location = "/path/to/saved/model"
grouped_arima_model.save(save_location)
Load
Loads a GroupedPmdarima
serialized model from a storage location.
Example:
from diviner import GroupedPmdarima
load_location = "/path/to/saved/model"
loaded_model = GroupedPmdarima.load(load_location)
Note
The load
method is a class method. As such, the initialization argument
model_template
does not need to be provided. It will be set on the loaded object based on the template that was
provided during initial training before serialization.
Utilities
Parameter Extraction
To extract the parameters that are either explicitly (or, in the case of AutoARIMA
, selectively) set during the fitting
of each individual model contained within the grouped collection, the method get_model_params
is used to extract the per-group parameters for each model into an output Pandas DataFrame
.
Note
The parameters can only be extracted from a GroupedPmdarima
model that has been fit.
Example:
from pmdarima.arima.auto import AutoARIMA
from diviner import GroupedPmdarima
base_arima_model = AutoARIMA(max_order=7, d=1, m=7, max_iter=1000)
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
fit_parameters = grouped_arima_model.get_model_params()
Metrics Extraction
This functionality allows for the retrieval of fitting metrics that are attached to the underlying SARIMA
model.
These are not the typical loss metrics that can be calculated through cross validation backtesting.
The metrics that are returned from fitting are:
hqic (Hannan-Quinn information criterion)
aicc (Corrected Akaike information criterion; aic for small sample sizes)
oob (out of bag error)
bic (Bayesian information criterion)
aic (Akaike information criterion)
Note
Out of bag error metric (oob) is only calculated if the underlying ARIMA
model has a value set for the argument
out_of_sample_size
. See out-of-bag-error for more information.
Example:
from pmdarima.arima.auto import AutoARIMA
from diviner import GroupedPmdarima
base_arima_model = AutoARIMA(max_order=7, d=1, m=7, max_iter=1000)
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
fit_metrics = grouped_arima_model.get_metrics()
Cross Validation Backtesting
Cross validation utilizing backtesting is the primary means of evaluating whether a given model will perform robustly in generating forecasts based on time period horizon events throughout the historical series.
In order to use the cross validation functionality in the method diviner.GroupedPmdarima.cross_validate()
,
one of two windowing split objects must be passed into the method signature:
Arguments to the diviner.GroupedPmdarima.cross_validate()
method:
- df
The original source
DataFrame
that was used duringdiviner.GroupedPmdarima.fit()
that contains the endogenous series data. ThisDataFrame
must contain the columns that define the constructed groups (i.e., missing group data will not be scored and groups that are not present in the model object will raise an Exception).- metrics
A collection of metric names to be used for evaluation. Submitted metrics must be one or more of:
Default:
{"smape", "mean_absolute_error", "mean_squared_error"}
- cross_validator
The cross validation object (either
RollingForecastCV
orSlidingWindowForecastCV
). See the example below for how to submit this object to thediviner.GroupedPmdarima.cross_validate()
method.- error_score
The default value to assign to a window evaluation if an error occurs during loss calculation.
Default:
np.nan
In order to throw an Exception, a str value of “raise” can be provided. Otherwise, supply a float.
- verbosity
Level of verbosity to print during training and cross validation. The lower the integer value, the fewer lines of debugging text is printed to stdout.
Default:
0
(no printing)
Example:
from pmdarima.arima.auto import AutoARIMA
from pmdarima.model_selection import SlidingWindowForecastCV
from diviner import GroupedPmdarima
base_arima_model = AutoARIMA(max_order=7, d=1, m=7, max_iter=1000)
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
cv_window = SlidingWindowForecastCV(h=28, step=180, window_size=365)
grouped_arima_cv = grouped_arima_model.cross_validate(df=df,
metrics=["mean_squared_error", "smape"],
cross_validator=cv_window,
error_score=np.nan,
verbosity=1
)
Class Signature of GroupedPmdarima
- class diviner.GroupedPmdarima(model_template)[source]
- cross_validate(df, metrics, cross_validator, error_score=nan, verbosity=0)[source]
Method for performing cross validation on each group of the fit model. The supplied cross_validator to this method will be used to perform either rolling or shifting window prediction validation throughout the data set. Windowing behavior for the cross validation must be defined and configured through the cross_validator that is submitted. See: https://alkaline-ml.com/pmdarima/modules/classes.html#cross-validation-split-utilities for details on the underlying implementation of cross validation with
pmdarima
.- Parameters
df – A
DataFrame
that contains the endogenous series and the grouping key columns that were defined during training. Any missing key entries will not be scored. Note that each group defined within the model will be retrieved from thisDataFrame
. keys that do not exist will raise an Exception.metrics – A list of metric names or string of single metric name to use for cross validation metric calculation.
cross_validator – A cross validator instance from
pmdarima.model_selection
(RollingForecastCV
orSlidingWindowForecastCV
). Note: setting low values ofh
orstep
will dramatically increase execution time).error_score –
Default value to assign to a score calculation if an error occurs in a given window iteration.
Default:
np.nan
(a silent ignore of the failure)verbosity –
print verbosity level for
pmdarima
’s cross validation stages.Default:
0
(no printing to stdout)
- Returns
Pandas DataFrame
containing the group information and calculated cross validation metrics for each group.
- fit(df, group_key_columns, y_col: str, datetime_col: str, exog_cols: Optional[List[str]] = None, ndiffs: Optional[Dict] = None, nsdiffs: Optional[Dict] = None, silence_warnings: bool = False, **fit_kwargs)[source]
Fit method for training a
pmdarima
model on the submitted normalized DataFrame. When initialized, the input DataFrame will be split into an iterable collection of grouped data sets based on thegroup_key_columns
arguments, which is then used to fit individualpmdarima
models (or a suppliedPipeline
) upon the templated object supplied as a class instance argument model_template. For API information forpmdarima
’sARIMA
,AutoARIMA
, andPipeline
APIs, see: https://alkaline-ml.com/pmdarima/modules/classes.html#api-ref- Parameters
df –
A normalized group data set consisting of a datetime column that defines ordering of the series, an endogenous regressor column that specifies the series data for training (e.g.
y_col
), and column(s) that define the grouping of the series data.An example normalized data set:
region
zone
country
ds
y
’northeast’
1
”US”
”2021-10-01”
1234.5
’northeast’
2
”US”
”2021-10-01”
3255.6
’northeast’
1
”US”
”2021-10-02”
1255.9
Wherein the grouping_key_columns could be one, some, or all of
['region', 'zone', 'country']
, the datetime_col would be the ‘ds’ column, and the seriesy_col
(endogenous regressor) would be ‘y’.group_key_columns – The columns in the
df
argument that define, in aggregate, a unique time series entry. For example, with the DataFrame referenced in thedf
param, group_key_columns could be:('region', 'zone')
or('region')
or('country', 'region', 'zone')
y_col – The name of the column within the DataFrame input to any method within this class that contains the endogenous regressor term (the raw data that will be used to train and use as a basis for forecasting).
datetime_col – The name of the column within the DataFrame input that defines the datetime or date values associated with each row of the endogenous regressor (
y_col
) data.exog_cols –
An optional collection of column names within the submitted data to class methods that contain exogenous regressor elements to use as part of model fitting and predicting.
Default:
None
ndiffs –
optional overrides to the
d
ARIMA
differencing term for stationarity enforcement. The structure of this argument is a dictionary in the form of:{<group_key>: <d_term>}
. To calculate, usediviner.PmdarimaAnalyzer.calculate_ndiffs()
Default:
None
nsdiffs –
optional overrides to the
D
SARIMAX seasonal differencing term for seasonal stationarity enforcement. The structure of this argument is a dictionary in the form of:{<group_key>: <D_term>}
. To calculate, use :py:meth:diviner.PmdarimaAnalyzer.calculate_nsdiffs
Default:
None
silence_warnings –
If
True
, removesSARIMAX
and underlying optimizer warning message from stdout printing. With a sufficiently large nubmer of groups to process, the volume of these messages to stdout may become very large.Default:
False
fit_kwargs –
fit_kwargs
forpmdarima
’sARIMA
,AutoARIMA
, orPipeline
stage overrides. For more information, see thepmdarima
docs: https://alkaline-ml.com/pmdarima/index.html
- Returns
object instance of
GroupedPmdarima
with the persisted fit model attached.
- get_metrics()[source]
Retrieve the
ARIMA
fit metrics that are generated during theAutoARIMA
orARIMA
training event. Note: These metrics are not validation metrics. Use thecross_validate()
method for retrieving back-testing error metrics.- Returns
Pandas
DataFrame
with metrics provided as columns and a row entry per group.
- get_model_params()[source]
Retrieve the parameters from the
fit
model_template
that was passed in and return them in a denormalizedPandas
DataFrame
. Parameters in the returnDataFrame
are columns with a row for each group defined duringfit()
.- Returns
Pandas
DataFrame
withfit
parameters for each group.
- classmethod load(path: str)[source]
Load a
GroupedPmdarima
instance from a saved serialized version. Note: This is a class instance and as such, aGroupedPmdarima
instance does not need to be initialized in order to load a saved model. For example:loaded_model = GroupedPmdarima.load(<location>)
- Parameters
path – The path to a serialized instance of
GroupedPmdarima
- Returns
The
GroupedPmdarima
instance that was saved.
- predict(n_periods: int, predict_col: str = 'yhat', alpha: float = 0.05, return_conf_int: bool = False, inverse_transform: bool = True, exog=None, **predict_kwargs)[source]
Prediction method for generating forecasts for each group that has been trained as part of a call to
fit()
. Note thatpmdarima
’s API does not support predictions outside of the defined datetime frequency that was validated during training (i.e., if the series endogenous data is at an hourly frequency, the generated predictions will be at an hourly frequency and cannot be modified from within this method).- Parameters
n_periods – The number of future periods to generate. The start of the generated predictions will be 1 frequency period after the maximum datetime value per group during training. For example, a data set used for training that has a datetime frequency in days that ends on 7/10/2021 will, with a value of
n_periods=7
, start its prediction on 7/11/2021 and generate daily predicted values up to and including 7/17/2021.predict_col –
The name to be applied to the column containing predicted data.
Default:
'yhat'
alpha –
Optional value for setting the confidence intervals for error estimates. Note: this is only utilized if
return_conf_int
is set toTrue
.Default:
0.05
(representing a 95% CI)return_conf_int –
Boolean flag for whether to calculate confidence interval error estimates for predicted values. The intervals of
yhat_upper
andyhat_lower
are based on thealpha
parameter.Default:
False
inverse_transform –
Optional argument used only for
Pipeline
models that include either aBoxCoxEndogTransformer
or aLogEndogTransformer
.Default:
True
exog –
Exogenous regressor components as a 2-D array. Note: if the model is trained with exogenous regressor components, this argument is required.
Default:
None
predict_kwargs – Extra
kwarg
arguments for any of the transform stages of aPipeline
or for additionalpredict
kwargs
to the model instance.Pipeline
kwargs
are specified in the manner ofsklearn
Pipeline
format (i.e.,<stage_name>__<arg name>=<value>
. e.g., to change the values of a fourier transformer at prediction time, the override would be:{'fourier__n_periods': 45})
- Returns
A consolidated (unioned) single DataFrame of predictions per group.
- predict_groups(groups: List[Tuple[str]], n_periods: int, predict_col: str = 'yhat', alpha: float = 0.05, return_conf_int: bool = False, inverse_transform: bool = False, exog=None, on_error: str = 'raise', **predict_kwargs)[source]
This is a prediction method that allows for generating a subset of forecasts based on the collection of keys. By specifying individual groups in the
groups
argument, a limited scope forecast can be performed without incurring the runtime costs associated with predicting all groups.- Parameters
groups –
List[Tuple[str]]
the collection of group (s) to generate forecast predictions. The group definitions must be the values within thegroup_key_columns
that were used during thefit
of the model in order to return valid forecasts.Note
The positional ordering of the values are important and must match the order of
group_key_columns
for thefit
argument to provide correct prediction forecasts.n_periods – The number of row events to forecast
predict_col – The name of the column in the output
DataFrame
that contains the forecasted series data. Default:"yhat"
alpha –
Optional value for setting the confidence intervals for error estimates. Note: this is only utilized if
return_conf_int
is set toTrue
.Default:
0.05
(representing a 95% CI)return_conf_int –
Boolean flag for whether to calculate confidence interval error estimates for predicted values. The intervals of
yhat_upper
andyhat_lower
are based on thealpha
parameter.Default:
False
inverse_transform –
Optional argument used only for
Pipeline
models that include either aBoxCoxEndogTransformer
or aLogEndogTransformer
.Default:
False
exog –
Exogenous regressor components as a 2-D array. Note: if the model is trained with exogenous regressor components, this argument is required.
Default:
None
predict_kwargs – Extra
kwarg
arguments for any of the transform stages of aPipeline
or for additionalpredict
kwargs
to the model instance.Pipeline
kwargs
are specified in the manner ofsklearn
Pipeline
format (i.e.,<stage_name>__<arg name>=<value>
. e.g., to change the values of a fourier transformer at prediction time, the override would be:{'fourier__n_periods': 45})
on_error –
Alert level setting for handling mismatched group keys. Default:
"raise"
The valid modes are:”ignore” - no logging or exception raising will occur if a submitted group key in the
groups
argument is not present in the model object.Note
This is a silent failure mode and will not present any indication of a failure to generate forecast predictions.
”warn” - any keys that are not present in the fit model will be recorded as logged warnings.
”raise” - any keys that are not present in the fit model will cause a
DivinerException
to be raised.
- Returns
A consolidated (unioned) single DataFrame of forecasts for all groups specified in the
groups
argument.
- save(path: str)[source]
Serialize and write the instance of this class (if it has been fit) to the path specified. Note: The serialized model is base64 encoded for top-level items and
pickle
’d forpmdarima
individual group models and anyPandas
DataFrame
.- Parameters
path – Path to write this model’s instance to.
- Returns
None
Grouped pmdarima Analysis tools
Warning
The PmdarimaAnalyzer
module is in experimental mode. The methods and signatures are subject to change in the
future with no deprecation warnings.
As a companion to Diviner
’s diviner.GroupedPmdarima
class, an analysis toolkit class is provided.
Contained within this class, PmdarimaAnalyzer
, are the following utility methods:
See below for a brief description of each of these utility methods that are available for group processing through the
PmdarimaAnalyzer
API.
Object instantiation:
from diviner import PmdarimaAnalyzer
analyzer = PmdarimaAnalyzer(
df=df,
group_key_columns=["country", "region"],
y_col="orders",
datetime_col="date"
)
Decompose Trends
The diviner.PmdarimaAnalyzer.decompose_groups()
method will decompose each series into its component parts:
trend
seasonal
random (also known as ‘residuals’)
The output of this method is a union of each group’s decomposed trends in a single DataFrame
that retains the group
key information in columns along with the extracted components from the series data.
This method is mainly used for validation of a new project.
Example:
decomposed_trends = analyzer.decompose_groups(m=7, type="additive")
Arguments to the diviner.PmdarimaAnalyzer.decompose_groups()
method:
- m
The frequency value of the endogenous series data. The integer supplied is a measure of the repeatable pattern of the estimated seasonality effect. For instance,
7
would be appropriate for daily measured data,24
would be a good starting point for hourly data, and52
would be a good initial validation value for weekly data.- type (
'type_'
) The type of decomposition to perform. One of:
"additive"
or"multiplicative"
. A good rule of thumb for determining which of these to choose is to determine whether the seasonality effects either stay constant as a function of the trend (which would be “additive”) or, if the seasonality effect is a function of the baseline trend value, “multiplicative” would be more appropriate. For further explanation, see here.- filter (
'filter_'
) [Optional] Reverse-sorted Array for performing convolution on the coefficients of either the
MA
terms or theAR
terms.Default:
None
Calculate Differencing Term
Isolating the differencing term 'd'
can provide significant performance improvements if AutoARIMA
is used as the
underlying estimator for each series. This method provides a means of estimating these per-group differencing terms.
The output is returned as a dictionary of {<group_key>: d}
Note
This utility method is intended to be used as an input to the diviner.GroupedPmdarima.fit()
method when using
AutoARIMA
as a base group estimator. It will set per-group values of d
so that the AutoARIMA optimizer
does not need to search for values of the differencing term, saving a great deal of computation time.
Example:
diffs = analyzer.calculate_ndiffs(alpha=0.1, test="kpss", max_d=5)
Arguments to the diviner.PmdarimaAnalyzer.calculate_ndiffs()
method:
- alpha
The significance value used in determining if a pvalue for a test of an estimated
d
term is significant or not. Default:0.05
- test
The stationarity unit test used to determine significance for a tested
d
term.Allowable values:
Default:
"kpss"
- max_d
The maximum allowable differencing term to test. Default:
2
Calculate Seasonal Differencing Term
Isolating the seasonal differencing term D
can provide a significant performance improvement to seasonal models
which are activated by setting the m
term in the base group estimator. The functionality of this
diviner.PmdarimaAnalyzer.calculate_nsdiffs()
method is similar to that of calculate_ndiffs
, except for the
seasonal differencing term.
Example:
seasonal_diffs = analyzer.calculate_nsdiffs(m=7, test="ocsb", max_D=5)
Arguments to the calculate_nsdiffs
method:
- m
The frequency of seasonal periods within the endogenous series. The integer supplied is a measure of the repeatable pattern of the estimated seasonality effect. For instance,
7
would be appropriate for daily measured data,24
would be a good starting point for hourly data, and52
would be a good initial validation value for weekly data.- test
The seasonality unit test used to determine an optimal seasonal differencing
D
term.Allowable tests:
Default:
"ocsb"
- max_D
The maximum allowable seasonal differencing term to test.
Default:
2
Calculate Constancy
The constancy check is a data set utility validation tool that operates on each grouped series, determining whether or not it can be modeled.
The output of this validation check method diviner.PmdarimaAnalyzer.calculate_is_constant()
is a dictionary of
{<group_key>: <Boolean constancy check>}
. Any group with a True
result is ineligible for modeling as this
indicates that the group has only a single constant value for each datetime period.
Example:
constancy_checks = analyzer.calculate_is_constant()
Calculate Auto Correlation Function
The diviner.PmdarimaAnalyzer.calculate_acf()
method is used for calculating the auto-correlation function for
each series group. The auto-correlation function values can be used (in conjunction with the result of partial
auto-correlation function results) to select restrictive search values for the ordering
terms for AutoARIMA
or to manually set the ordering terms ((p, d, q)
) for ARIMA
.
Note
The general rule to determine whether to use an AR, MA, or ARMA configuration for ARIMA
or AutoARIMA
is
as follows:
ACF gradually trend to significance, PACF significance achieved after 1 lag -> AR model
ACF significance after 1 lag, PACF gradually trend to significance -> MA model
ACF gradually trend to significance, PACF gradually trend to significance -> ARMA model
These results can help to set the order terms of an ARIMA model (p and q) or, for AutoARIMA, set restrictions on maximum search space terms to assist in faster optimization of the model.
Arguments to the calculate_acf
method:
- unbiased
auto-covariance denominator flag with values of:
True
-> denominator =n - k
False
-> denominator =n
- nlags
The number of auto-correlation lags to calculate and return.
Default:
40
- qstat
Boolean flag to calculate and return the Q statistic from the Ljung-Box test.
Default:
False
- fft
Whether to perform a fast fourier transformation of the series to calculate the auto-correlation function. For large time series, it is highly recommended to set this to
True
. Allowable values:True
,False
, orNone
.Default:
None
- alpha
If specified as a float, calculates and returns confidence intervals at this certainty level for the auto-correlation function values. For example, if alpha=0.1, 90% confidence intervals are calculated and returned wherein the standard deviation is computed according to Bartlett’s formula.
Default:
None
- missing
Handling of
NaN
values in series data. Available options are:None
- no validation checks are performed.'raise'
- an Exception is raised if a missing value is detected.'conservative'
-NaN
values are removed from the mean and cross-product calculations but are not removed from the series data.'drop'
-NaN
values are removed from the series data.
Default:
None
- adjusted
Deprecation handler for the underlying
statsmodels
arguments that have become theunbiased
argument. This is a duplicated value for the denominator mode of calculation for the autocovariance of the series.Default:
False
Calculate Partial Auto Correlation Function
The diviner.PmdarimaAnalyzer.calculate_pacf()
method is used for determining the partial auto-correlation function
for each series group. When combined with Calculate Auto Correlation Function results, ordering values can be estimated (or controlled
in search space scope for AutoARIMA
). See the notes in Calculate Auto Correlation Function for how to use the results from these two methods.
Arguments to the calculate_pacf
method:
- nlags
The number of partial auto-correlation lags to calculate and return.
Default:
40
- method
The method employed for calculating the partial auto-correlation function. Methods and their explanations are listed in the pmdarima docs.
Default:
'ywadjusted'
- alpha
If specified as a float, calculates and returns confidence intervals at this certainty level for the auto-correlation function values. For example, if alpha=0.1, 90% confidence intervals are calculated and returned wherein the standard deviation is computed according to Bartlett’s formula.
Default:
None
Generate Diff
The utility method diviner.PmdarimaAnalyzer.generate_diff()
will generate lag differences for each group.
While not applicable to most timeseries modeling problems, it can prove to be useful in certain situations or as a
diagnostic tool to investigate why a particular series is not fitting properly.
Arguments for this method:
- lag
The magnitude of the lag used in calculating the differencing. Default:
1
- differences
The order of the differencing to be performed. Default:
1
For an illustrative example, see the diff example.
Generate Diff Inversion
The utility method diviner.PmdarimaAnalyzer.generate_diff_inversion()
will invert a previously differenced
grouped series.
Arguments for this method:
- group_diff_data
The differenced data from the usage of
diviner.PmdarimaAnalyzer.generate_diff()
.- lag
The magnitude of the lag that was used in the differencing function in order to revert the diff.
Default:
1
- differences
The order of the differencing that was performed using
diviner.PmdarimaAnalyzer.generate_diff()
so that the series data can be reverted.Default:
1
- recenter
If
True
and'series_start'
exists ingroup_diff_data
dict, will restore the original series range for each group based on the series start value calculated through thegenerate_diff()
method. If thegroup_diff_data
does not contain the starting values, the data will not be re-centered.Default:
False
Class Signature of PmdarimaAnalyzer
- class diviner.PmdarimaAnalyzer(df, group_key_columns, y_col, datetime_col)[source]
- calculate_acf(unbiased=False, nlags=None, qstat=False, fft=None, alpha=None, missing='none', adjusted=False)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility for calculating the autocorrelation function for each group. Combined with a partial autocorrelation function calculation, the return values can greatly assist in setting AR, MA, or ARMA terms for a given model.
The general rule to determine whether to use an AR, MA, or ARMA configuration for ARIMA (or AutoARIMA) is as follows:
ACF gradually trend to significance, PACF significance achieved after 1 lag -> AR model
ACF significance after 1 lag, PACF gradually trend to significance -> MA model
ACF gradually trend to significance, PACF gradually trend to significance -> ARMA model
These results can help to set the order terms of an ARIMA model (p and q) or, for AutoARIMA, set restrictions on maximum search space terms to assist in faster optimization of the model.
- Parameters
unbiased –
Boolean flag that sets the autocovariance denominator to
'n-k'
ifTrue
andn
ifFalse
.Note: This argument is deprecated and removed in versions of pmdarima > 2.0.0
Default:
False
nlags –
The count of autocorrelation lags to calculate and return.
Default:
40
qstat –
Boolean flag to calculate and return the Ljung-Box statistic for each lag.
Default:
False
fft –
Boolean flag for whether to use fast fourier transformation (fft) for computing the autocorrelation function. FFT is recommended for large time series data sets.
Default:
None
alpha –
If specified, calculates and returns the confidence intervals for the acf values at the level set (i.e., for 90% confidence, an alpha of 0.1 would be set)
Default:
None
missing –
handling of NaN values in the series data.
Available options:
['none', 'raise', 'conservative', 'drop']
.none
: no checks are performed.raise
: an Exception is raised if NaN values are in the series.conservative
: the autocovariance is calculated by removing NaN values from the mean and cross-product calculations but are not eliminated from the series.drop
:NaN
values are removed from the series and adjacent values toNaN
’s are treated as contiguous (which may invalidate the results in certain situations).Default:
'none'
adjusted – Deprecation handler for the underlying
statsmodels
arguments that have become theunbiased
argument. This is a duplicated value for the denominator mode of calculation for the autocovariance of the series.
- Returns
Dictionary of
{<group_key>: {<acf terms>: <values as array>}}
- calculate_is_constant()[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility method for determining whether or not a series is composed of all of the same elements or not. (e.g. a series of {1, 2, 3, 4, 5, 1, 2, 3} will return ‘False’, while a series of {1, 1, 1, 1, 1, 1, 1, 1, 1} will return ‘True’)
- Returns
Dictionary of
{<group_key>: <Boolean constancy check>}
- calculate_ndiffs(alpha=0.05, test='kpss', max_d=2)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility method for determining the optimal
d
value for ARIMA ordering. Calculating this as a fixed value can dramatically increase the tuning time forpmdarima
models.- Parameters
alpha –
significance level for determining if a pvalue used for testing a value of
'd'
is significant or not.Default:
0.05
test –
Type of unit test for stationarity determination to use. Supported values:
['kpss', 'adf', 'pp']
See:https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.KPSSTest. html#pmdarima.arima.KPSSTest
https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.PPTest. html#pmdarima.arima.PPTest
https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.ADFTest. html#pmdarima.arima.ADFTest
Default:
'kpss'
max_d – The max value for
d
to test.
- Returns
Dictionary of
{<group_key>: <optimal 'd' value>}
- calculate_nsdiffs(m, test='ocsb', max_D=2)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
- Utility method for determining the optimal
D
value for seasonalSARIMAX
ordering of ('P', 'D', 'Q')
.
- Parameters
m – The number of seasonal periods in the series.
test –
Type of unit test for seasonality. Supported tests:
['ocsb', 'ch']
See:https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.OCSBTest. html#pmdarima.arima.OCSBTest
https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.CHTest. html#pmdarima.arima.CHTest
Default:
'ocsb'
max_D –
Maximum number of seasonal differences to test for.
Default: 2
- Returns
Dictionary of
{<group_key>: <optimal 'D' value>}
- Utility method for determining the optimal
- calculate_pacf(nlags=None, method='ywadjusted', alpha=None)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility for calculating the partial autocorrelation function for each group. In conjunction with the autocorrelation function
calculate_acf
, the values returned from a pacf calculation can assist in setting values or bounds on AR, MA, and ARMA terms for an ARIMA model.The general rule to determine whether to use an AR, MA, or ARMA configuration for
ARIMA
(orAutoARIMA
) is as follows:ACF gradually trend to significance, PACF significance achieved after 1 lag -> AR model
ACF significance after 1 lag, PACF gradually trend to significance -> MA model
ACF gradually trend to significance, PACF gradually trend to significance -> ARMA model
These results can help to set the order terms of an ARIMA model (
p
andq
) or, forAutoARIMA
, set restrictions on maximum search space terms to assist in faster optimization of the model.- Parameters
nlags –
The count of partial autocorrelation lags to calculate and return.
Default:
40
method –
The method used for pacf calculation. See the
pmdarima
docs for full listing of methods:https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.utils.pacf.html
Default:
'ywadjusted'
alpha –
If specified, returns confidence intervals based on the alpha value supplied.
Default:
None
- Returns
Dictionary of
{<group_key>: {<pacf terms>: <values as array>}}
- decompose_groups(m, type_, filter_=None)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility method that wraps
pmdarima.arima.decompose()
for each group within the passed-in DataFrame. Note: decomposition works best if the total number of entries within the series being decomposed is a multiple of the m parameter value.- Parameters
m – The frequency of the endogenous series. (i.e., for daily data, an
m
value of'7'
would be appropriate for estimating a weekly seasonality, while settingm
to'365'
would be effective for yearly seasonality effects.)type –
The type of decomposition to perform. One of:
['additive', 'multiplicative']
See: https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima. decompose.html
filter –
Optional Array for performing convolution. This is specified as a filter for coefficients (the Moving Average and/or Auto Regressor coefficients) in reverse time order in order to filter out a seasonal component.
Default: None
- Returns
Pandas DataFrame with the decomposed trends for each group.
- generate_diff(lag=1, differences=1)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
A utility for generating the array diff (lag differences) for each group. To support invertability, this method will return the starting value of each array as well as the differenced values.
- Parameters
lag –
Determines the magnitude of the lag to calculate the differencing function for.
Default:
1
differences –
The order of the differencing to be performed. Note that values > 1 will generate n fewer results.
Default:
1
- Returns
Dictionary of
{<group_key>: {"series_start": <float>, "diff": <diff_array>}}
- static generate_diff_inversion(group_diff_data, lag=1, differences=1, recenter=False)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
A utility for inverting a previously differenced group of timeseries data. This utility supports returning each group’s series data to the original range of the data if the recenter argument is set to True and the start conditions are contained within the
group_diff_data
argument’s dictionary structure.- Parameters
group_diff_data – Differenced payload consisting of a dictionary of
{<group_key>: {'diff': <differenced data>, [optional]'series_start': float}}
lag –
The lag to use to perform the differencing inversion.
Default:
1
differences –
The order of differencing to be used during the inversion.
Default:
1
recenter –
If
True
and'series_start'
exists ingroup_diff_data
dict, will restore the original series range for each group based on the series start value calculated through thegenerate_diff()
method. If thegroup_diff_data
does not contain the starting values, the data will not be re-centered.Default:
False
- Returns
Dictionary of
{<group_key>: <series_inverted_data>}