Grouped pmdarima
The Grouped pmdarima API is a multi-series orchestration framework for building multiple individual models
of related, but isolated series data. For example, a project that required the forecasting of inventory demand at
regional warehouses around the world would historically require individual orchestration of data acquisition, hyperparameter
definitions, model training, metric validation, serialization, and registration of tens of thousands of individual
models based on the permutations of SKU and warehouse location.
This API consolidates the many thousands of models that would otherwise need to be implemented, trained individually, and managed throughout their frequent retraining and forecasting lifecycles to a single high-level API that simplifies these common use cases that rely on the pmdarima forecasting library.
Table of Contents
Grouped pmdarima API
The following sections provide a basic overview of using the GroupedPmdarima API,
from fitting of the grouped models, predicting forecasted data, saving, loading, and customization of the underlying
pmdarima instances.
To see working end-to-end examples, you can go to Tutorials and Examples. The examples will allow you to explore the data structures required for training, how to extract forecasts for each group, and demonstrations of the saving and loading of trained models.
Base Estimators and API interface
The usage of the GroupedPmdarima API is slightly different from the other grouped
forecasting library wrappers within Diviner. This is due to the ability of pmdarima to support
multiple modes of configuration.
These modes that are available to construct a model are:
Passing an
ARIMAmodel template (wrapper around statsmodels ARIMA)Using the native
pmdarimaAutoARIMAmodel templateConstructing a
pmdarimaPipeline template
The GroupedPmdarima implementation requires the submission of one of these 3
model templates to set the base configured model architecture for each group.
For example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
# Define the base ARIMA with a preset ordering parameter
base_arima_model = ARIMA(order=(1, 0, 2))
# Define the model template in the GroupedPmdarima constructor
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
The above example is intended only to showcase the interface between a base estimator (base_arima_model) and the
instance constructor for GroupedPmdarima. For a more in-depth and realistic example of utilizing an ARIMA model manually,
see the additional statistical validation steps that would be required for this in the Tutorials and Examples section
of the docs.
Model fitting
In order to fit a GroupedPmdarima model instance, the fit
method is used. Calling this method will process the input DataFrame to create a grouped execution collection,
fit a pmdarima model type on each individual series, and persist the trained state of each group’s model to the
object instance.
The arguments for the fit method are:
- df
A ‘normalized’ DataFrame that contains an endogenous regressor column (the ‘y’ column), a date (or datetime) column (that defines the ordering, periodicity, and frequency of each series (if this column is a string, the frequency will be inferred)), and grouping column(s) that define the discrete series to be modeled. For further information on the structure of this
DataFrame, see the quickstart guide- group_key_columns
The names of the columns within
dfthat, when combined (in order supplied) define distinct series. See the quickstart guide for further information.- y_col
Name of the endogenous regressor term within the
DataFrameargumentdf. This column contains the values of the series that are used during training.- datetime_col
Name of the column within the
dfargumentDataFramethat defines the datetime ordering of the series data.- exog_cols
[Optional] A collection of column names within the submitted data that contain exogenous regressor elements to use as part of model fitting and predicting. The data within each column will be assembled into a 2D array for use in the regression.
Note
pmdarima currently has exogeneous regressor support marked as a future deprecated feature. Usage of this
functionality is not recommended except for existing legacy implementations.
- ndiffs
[Optional] A dictionary of
{<group_key>: <d value>}for the differencing term for each group. This is intended to function alongside the output from thediviner.PmdarimaAnalyzer.calculate_ndiffs()method, serving to reduce the search space forAutoARIMAby supplying fixeddvalues to each group’s model.- nsdiffs
[Optional] A dictionary of
{<group_key>: <D value>}for the seasonal differencing term for each group. This is intended to function alongside the output from thediviner.PmdarimaAnalyzer.calculate_nsdiffs()method, serving to reduce the search space forAutoARIMAby supplying fixedDvalues to each group’s model.
Note
These values will only be used if the models being fit are seasonal models. The value m must be set on the
underlying ARIMA or AutoARIMA model for seasonality order components to be used.
- silence_warnings
[Optional] Whether to silence stdout reporting of the underlying
pmdarimafit process. Default: False.- fit_kwargs
[Optional]
fit_kwargsforpmdarimaARIMA,AutoARIMA, orPipelinestages overrides. For more information, see thepmdarimadocs
Example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
base_arima_model = ARIMA(order=(1, 0, 2))
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
Predict
The predict method generates forecast data for each grouped series within
the meta diviner.GroupedPmdarima model.
Example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
base_arima_model = ARIMA(order=(1, 0, 2))
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
forecasts = grouped_arima_model.predict(n_periods=30)
The arguments for the predict method are:
- n_periods
The number of future periods to generate from the end of each group’s series. The first value of the prediction forecast series will begin at one periodicity value after the end of the training series. For example, if the training series was of daily data from 2019-10-01 to 2021-10-02, the start of the prediction series output would be 2021-10-03 and continue for
n_periodsdays from that point.- predict_col
[Optional] The name to use for the generated column containing forecasted data. Default:
"yhat"- alpha
[Optional] Confidence interval significance value for error estimates. Default:
0.05.
Note
alpha is only used if the boolean flag return_conf_int is set to True.
- return_conf_int
[Optional] Boolean flag for whether or not to calculate confidence intervals for the predicted forecasts. If
True, the columns"yhat_upper"and"yhat_lower"will be added to the outputDataFramefor the upper and lower confidence intervals for the predictions.- inverse_transform
[Optional] Used exclusively for
Pipelinebased models that include an endogeneous transformer such asBoxCoxEndogTransformerorLogEndogTransformer. Default:True(although it only applies if themodel_templatetype passed in is aPipelinethat contains a transformer). An inversion of the endogeneous regression term can be helpful for distributions that are highly non-normal. For further reading on what the purpose of these functions are, why they are used, and how they might be applicable to a given time series, see data transformation.- exog
[Optional] If the original model was trained with an exogeneous regressor elements, the prediction will require these 2D arrays at prediction time. This argument is used to hold the 2D array of future exogeneous regressor values to be used in generating the prediction for the regressor.
- predict_kwargs
[Optional] Extra
kwargarguments for any of the transform stages of aPipelineor for additionalpredictkwargsto the model instance.Pipelinekwargsare specified in the manner ofsklearnPipelineformat (i.e.,<stage_name>__<arg name>=<value>. e.g., to change the values of a fourier transformer at prediction time, the override would be:{'fourier__n_periods': 45})
Predict Groups
The predict_groups method generates forecast data for a subset of
groups that a diviner.GroupedPmdarima model was trained upon.
Example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
base_model = ARIMA(order=(2, 1, 2))
grouped_arima = GroupedPmdarima(model_template=base_model)
model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
subset_forecasts = model.predict_groups(groups=[("US", "NY"), ("FR", "Paris"), ("UA", "Kyiv")], n_periods=90)
The arguments for the predict_groups method are:
- groups
A collection of one or more groups for which to generate a forecast. The collection of groups must be submitted as a
List[Tuple[str]]to identify the order-specific group values to retrieve the correct model. For instance, if the model was trained with the specifiedgroup_key_columnsof["country", "city"], a validgroupsentry would be:[("US", "LosAngeles"), ("CA", "Toronto")]. Changing the order within the tuples will not resolve (e.g.[("NewYork", "US")]would not find the appropriate model).Note
Groups that are submitted for prediction that are not present in the trained model will, by default, cause an Exception to be raised. This behavior can be changed to a warning or ignore status with the argument
on_error.- n_periods
The number of future periods to generate from the end of each group’s series. The first value of the prediction forecast series will begin at one periodicity value after the end of the training series. For example, if the training series was of daily data from 2019-10-01 to 2021-10-02, the start of the prediction series output would be 2021-10-03 and continue for
n_periodsdays from that point.- predict_col
[Optional] The name to use for the generated column containing forecasted data. Default:
"yhat"- alpha
[Optional] Confidence interval significance value for error estimates. Default:
0.05.
Note
alpha is only used if the boolean flag return_conf_int is set to True.
- return_conf_int
[Optional] Boolean flag for whether or not to calculate confidence intervals for the predicted forecasts. If
True, the columns"yhat_upper"and"yhat_lower"will be added to the outputDataFramefor the upper and lower confidence intervals for the predictions.- inverse_transform
[Optional] Used exclusively for
Pipelinebased models that include an endogeneous transformer such asBoxCoxEndogTransformerorLogEndogTransformer. Default:True(although it only applies if themodel_templatetype passed in is aPipelinethat contains a transformer). An inversion of the endogeneous regression term can be helpful for distributions that are highly non-normal. For further reading on what the purpose of these functions are, why they are used, and how they might be applicable to a given time series, see this link.- exog
[Optional] If the original model was trained with an exogeneous regressor elements, the prediction will require these 2D arrays at prediction time. This argument is used to hold the 2D array of future exogeneous regressor values to be used in generating the prediction for the regressor.
- on_error
[Optional] [Default ->
"raise"] Dictates the behavior for handling group keys that have been submitted in thegroupsargument that do not match with a group identified and registered during training (fit). The modes are:"raise"A
DivinerExceptionis raised if any supplied groups do not match to the fitted groups.
"warn"A warning is emitted (printed) and logged for any groups that do not match to those that the model was fit with.
"ignore"Invalid groups will silently fail prediction.
Note
A
DivinerExceptionwill still be raised even in"ignore"mode if there are no valid fit groups to match the providedgroupsprovided to this method.- predict_kwargs
[Optional] Extra
kwargarguments for any of the transform stages of aPipelineor for additionalpredictkwargsto the model instance.Pipelinekwargsare specified in the manner ofsklearnPipelineformat (i.e.,<stage_name>__<arg name>=<value>. e.g., to change the values of a fourier transformer at prediction time, the override would be:{'fourier__n_periods': 45})
Save
Saves a GroupedPmdarima instance that has been fit.
The serialization of the model instance uses a base64 encoding of the pickle serialization of each model instance within
the grouped structure.
Example:
from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima
base_arima_model = ARIMA(order=(1, 0, 2))
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
save_location = "/path/to/saved/model"
grouped_arima_model.save(save_location)
Load
Loads a GroupedPmdarima serialized model from a storage location.
Example:
from diviner import GroupedPmdarima
load_location = "/path/to/saved/model"
loaded_model = GroupedPmdarima.load(load_location)
Note
The load method is a class method. As such, the initialization argument
model_template does not need to be provided. It will be set on the loaded object based on the template that was
provided during initial training before serialization.
Utilities
Parameter Extraction
To extract the parameters that are either explicitly (or, in the case of AutoARIMA, selectively) set during the fitting
of each individual model contained within the grouped collection, the method get_model_params
is used to extract the per-group parameters for each model into an output Pandas DataFrame.
Note
The parameters can only be extracted from a GroupedPmdarima model that has been fit.
Example:
from pmdarima.arima.auto import AutoARIMA
from diviner import GroupedPmdarima
base_arima_model = AutoARIMA(max_order=7, d=1, m=7, max_iter=1000)
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
fit_parameters = grouped_arima_model.get_model_params()
Metrics Extraction
This functionality allows for the retrieval of fitting metrics that are attached to the underlying SARIMA model.
These are not the typical loss metrics that can be calculated through cross validation backtesting.
The metrics that are returned from fitting are:
hqic (Hannan-Quinn information criterion)
aicc (Corrected Akaike information criterion; aic for small sample sizes)
oob (out of bag error)
bic (Bayesian information criterion)
aic (Akaike information criterion)
Note
Out of bag error metric (oob) is only calculated if the underlying ARIMA model has a value set for the argument
out_of_sample_size. See out-of-bag-error for more information.
Example:
from pmdarima.arima.auto import AutoARIMA
from diviner import GroupedPmdarima
base_arima_model = AutoARIMA(max_order=7, d=1, m=7, max_iter=1000)
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
fit_metrics = grouped_arima_model.get_metrics()
Cross Validation Backtesting
Cross validation utilizing backtesting is the primary means of evaluating whether a given model will perform robustly in generating forecasts based on time period horizon events throughout the historical series.
In order to use the cross validation functionality in the method diviner.GroupedPmdarima.cross_validate(),
one of two windowing split objects must be passed into the method signature:
Arguments to the diviner.GroupedPmdarima.cross_validate() method:
- df
The original source
DataFramethat was used duringdiviner.GroupedPmdarima.fit()that contains the endogenous series data. ThisDataFramemust contain the columns that define the constructed groups (i.e., missing group data will not be scored and groups that are not present in the model object will raise an Exception).- metrics
A collection of metric names to be used for evaluation. Submitted metrics must be one or more of:
Default:
{"smape", "mean_absolute_error", "mean_squared_error"}- cross_validator
The cross validation object (either
RollingForecastCVorSlidingWindowForecastCV). See the example below for how to submit this object to thediviner.GroupedPmdarima.cross_validate()method.- error_score
The default value to assign to a window evaluation if an error occurs during loss calculation.
Default:
np.nanIn order to throw an Exception, a str value of “raise” can be provided. Otherwise, supply a float.
- verbosity
Level of verbosity to print during training and cross validation. The lower the integer value, the fewer lines of debugging text is printed to stdout.
Default:
0(no printing)
Example:
from pmdarima.arima.auto import AutoARIMA
from pmdarima.model_selection import SlidingWindowForecastCV
from diviner import GroupedPmdarima
base_arima_model = AutoARIMA(max_order=7, d=1, m=7, max_iter=1000)
grouped_arima = GroupedPmdarima(model_template=base_arima_model)
grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")
cv_window = SlidingWindowForecastCV(h=28, step=180, window_size=365)
grouped_arima_cv = grouped_arima_model.cross_validate(df=df,
metrics=["mean_squared_error", "smape"],
cross_validator=cv_window,
error_score=np.nan,
verbosity=1
)
Class Signature of GroupedPmdarima
- class diviner.GroupedPmdarima(model_template)[source]
- cross_validate(df, metrics, cross_validator, error_score=nan, verbosity=0)[source]
Method for performing cross validation on each group of the fit model. The supplied cross_validator to this method will be used to perform either rolling or shifting window prediction validation throughout the data set. Windowing behavior for the cross validation must be defined and configured through the cross_validator that is submitted. See: https://alkaline-ml.com/pmdarima/modules/classes.html#cross-validation-split-utilities for details on the underlying implementation of cross validation with
pmdarima.- Parameters
df – A
DataFramethat contains the endogenous series and the grouping key columns that were defined during training. Any missing key entries will not be scored. Note that each group defined within the model will be retrieved from thisDataFrame. keys that do not exist will raise an Exception.metrics – A list of metric names or string of single metric name to use for cross validation metric calculation.
cross_validator – A cross validator instance from
pmdarima.model_selection(RollingForecastCVorSlidingWindowForecastCV). Note: setting low values ofhorstepwill dramatically increase execution time).error_score –
Default value to assign to a score calculation if an error occurs in a given window iteration.
Default:
np.nan(a silent ignore of the failure)verbosity –
print verbosity level for
pmdarima’s cross validation stages.Default:
0(no printing to stdout)
- Returns
Pandas DataFramecontaining the group information and calculated cross validation metrics for each group.
- fit(df, group_key_columns, y_col: str, datetime_col: str, exog_cols: Optional[List[str]] = None, ndiffs: Optional[Dict] = None, nsdiffs: Optional[Dict] = None, silence_warnings: bool = False, **fit_kwargs)[source]
Fit method for training a
pmdarimamodel on the submitted normalized DataFrame. When initialized, the input DataFrame will be split into an iterable collection of grouped data sets based on thegroup_key_columnsarguments, which is then used to fit individualpmdarimamodels (or a suppliedPipeline) upon the templated object supplied as a class instance argument model_template. For API information forpmdarima’sARIMA,AutoARIMA, andPipelineAPIs, see: https://alkaline-ml.com/pmdarima/modules/classes.html#api-ref- Parameters
df –
A normalized group data set consisting of a datetime column that defines ordering of the series, an endogenous regressor column that specifies the series data for training (e.g.
y_col), and column(s) that define the grouping of the series data.An example normalized data set:
region
zone
country
ds
y
’northeast’
1
”US”
”2021-10-01”
1234.5
’northeast’
2
”US”
”2021-10-01”
3255.6
’northeast’
1
”US”
”2021-10-02”
1255.9
Wherein the grouping_key_columns could be one, some, or all of
['region', 'zone', 'country'], the datetime_col would be the ‘ds’ column, and the seriesy_col(endogenous regressor) would be ‘y’.group_key_columns – The columns in the
dfargument that define, in aggregate, a unique time series entry. For example, with the DataFrame referenced in thedfparam, group_key_columns could be:('region', 'zone')or('region')or('country', 'region', 'zone')y_col – The name of the column within the DataFrame input to any method within this class that contains the endogenous regressor term (the raw data that will be used to train and use as a basis for forecasting).
datetime_col – The name of the column within the DataFrame input that defines the datetime or date values associated with each row of the endogenous regressor (
y_col) data.exog_cols –
An optional collection of column names within the submitted data to class methods that contain exogenous regressor elements to use as part of model fitting and predicting.
Default:
Nonendiffs –
optional overrides to the
dARIMAdifferencing term for stationarity enforcement. The structure of this argument is a dictionary in the form of:{<group_key>: <d_term>}. To calculate, usediviner.PmdarimaAnalyzer.calculate_ndiffs()Default:
Nonensdiffs –
optional overrides to the
DSARIMAX seasonal differencing term for seasonal stationarity enforcement. The structure of this argument is a dictionary in the form of:{<group_key>: <D_term>}. To calculate, use :py:meth:diviner.PmdarimaAnalyzer.calculate_nsdiffsDefault:
Nonesilence_warnings –
If
True, removesSARIMAXand underlying optimizer warning message from stdout printing. With a sufficiently large nubmer of groups to process, the volume of these messages to stdout may become very large.Default:
Falsefit_kwargs –
fit_kwargsforpmdarima’sARIMA,AutoARIMA, orPipelinestage overrides. For more information, see thepmdarimadocs: https://alkaline-ml.com/pmdarima/index.html
- Returns
object instance of
GroupedPmdarimawith the persisted fit model attached.
- get_metrics()[source]
Retrieve the
ARIMAfit metrics that are generated during theAutoARIMAorARIMAtraining event. Note: These metrics are not validation metrics. Use thecross_validate()method for retrieving back-testing error metrics.- Returns
PandasDataFramewith metrics provided as columns and a row entry per group.
- get_model_params()[source]
Retrieve the parameters from the
fitmodel_templatethat was passed in and return them in a denormalizedPandasDataFrame. Parameters in the returnDataFrameare columns with a row for each group defined duringfit().- Returns
PandasDataFramewithfitparameters for each group.
- classmethod load(path: str)[source]
Load a
GroupedPmdarimainstance from a saved serialized version. Note: This is a class instance and as such, aGroupedPmdarimainstance does not need to be initialized in order to load a saved model. For example:loaded_model = GroupedPmdarima.load(<location>)- Parameters
path – The path to a serialized instance of
GroupedPmdarima- Returns
The
GroupedPmdarimainstance that was saved.
- predict(n_periods: int, predict_col: str = 'yhat', alpha: float = 0.05, return_conf_int: bool = False, inverse_transform: bool = True, exog=None, **predict_kwargs)[source]
Prediction method for generating forecasts for each group that has been trained as part of a call to
fit(). Note thatpmdarima’s API does not support predictions outside of the defined datetime frequency that was validated during training (i.e., if the series endogenous data is at an hourly frequency, the generated predictions will be at an hourly frequency and cannot be modified from within this method).- Parameters
n_periods – The number of future periods to generate. The start of the generated predictions will be 1 frequency period after the maximum datetime value per group during training. For example, a data set used for training that has a datetime frequency in days that ends on 7/10/2021 will, with a value of
n_periods=7, start its prediction on 7/11/2021 and generate daily predicted values up to and including 7/17/2021.predict_col –
The name to be applied to the column containing predicted data.
Default:
'yhat'alpha –
Optional value for setting the confidence intervals for error estimates. Note: this is only utilized if
return_conf_intis set toTrue.Default:
0.05(representing a 95% CI)return_conf_int –
Boolean flag for whether to calculate confidence interval error estimates for predicted values. The intervals of
yhat_upperandyhat_lowerare based on thealphaparameter.Default:
Falseinverse_transform –
Optional argument used only for
Pipelinemodels that include either aBoxCoxEndogTransformeror aLogEndogTransformer.Default:
Trueexog –
Exogenous regressor components as a 2-D array. Note: if the model is trained with exogenous regressor components, this argument is required.
Default:
Nonepredict_kwargs – Extra
kwargarguments for any of the transform stages of aPipelineor for additionalpredictkwargsto the model instance.Pipelinekwargsare specified in the manner ofsklearnPipelineformat (i.e.,<stage_name>__<arg name>=<value>. e.g., to change the values of a fourier transformer at prediction time, the override would be:{'fourier__n_periods': 45})
- Returns
A consolidated (unioned) single DataFrame of predictions per group.
- predict_groups(groups: List[Tuple[str]], n_periods: int, predict_col: str = 'yhat', alpha: float = 0.05, return_conf_int: bool = False, inverse_transform: bool = False, exog=None, on_error: str = 'raise', **predict_kwargs)[source]
This is a prediction method that allows for generating a subset of forecasts based on the collection of keys. By specifying individual groups in the
groupsargument, a limited scope forecast can be performed without incurring the runtime costs associated with predicting all groups.- Parameters
groups –
List[Tuple[str]]the collection of group (s) to generate forecast predictions. The group definitions must be the values within thegroup_key_columnsthat were used during thefitof the model in order to return valid forecasts.Note
The positional ordering of the values are important and must match the order of
group_key_columnsfor thefitargument to provide correct prediction forecasts.n_periods – The number of row events to forecast
predict_col – The name of the column in the output
DataFramethat contains the forecasted series data. Default:"yhat"alpha –
Optional value for setting the confidence intervals for error estimates. Note: this is only utilized if
return_conf_intis set toTrue.Default:
0.05(representing a 95% CI)return_conf_int –
Boolean flag for whether to calculate confidence interval error estimates for predicted values. The intervals of
yhat_upperandyhat_lowerare based on thealphaparameter.Default:
Falseinverse_transform –
Optional argument used only for
Pipelinemodels that include either aBoxCoxEndogTransformeror aLogEndogTransformer.Default:
Falseexog –
Exogenous regressor components as a 2-D array. Note: if the model is trained with exogenous regressor components, this argument is required.
Default:
Nonepredict_kwargs – Extra
kwargarguments for any of the transform stages of aPipelineor for additionalpredictkwargsto the model instance.Pipelinekwargsare specified in the manner ofsklearnPipelineformat (i.e.,<stage_name>__<arg name>=<value>. e.g., to change the values of a fourier transformer at prediction time, the override would be:{'fourier__n_periods': 45})on_error –
Alert level setting for handling mismatched group keys. Default:
"raise"The valid modes are:”ignore” - no logging or exception raising will occur if a submitted group key in the
groupsargument is not present in the model object.Note
This is a silent failure mode and will not present any indication of a failure to generate forecast predictions.
”warn” - any keys that are not present in the fit model will be recorded as logged warnings.
”raise” - any keys that are not present in the fit model will cause a
DivinerExceptionto be raised.
- Returns
A consolidated (unioned) single DataFrame of forecasts for all groups specified in the
groupsargument.
- save(path: str)[source]
Serialize and write the instance of this class (if it has been fit) to the path specified. Note: The serialized model is base64 encoded for top-level items and
pickle’d forpmdarimaindividual group models and anyPandasDataFrame.- Parameters
path – Path to write this model’s instance to.
- Returns
None
Grouped pmdarima Analysis tools
Warning
The PmdarimaAnalyzer module is in experimental mode. The methods and signatures are subject to change in the
future with no deprecation warnings.
As a companion to Diviner’s diviner.GroupedPmdarima class, an analysis toolkit class is provided.
Contained within this class, PmdarimaAnalyzer, are the following utility methods:
See below for a brief description of each of these utility methods that are available for group processing through the
PmdarimaAnalyzer API.
Object instantiation:
from diviner import PmdarimaAnalyzer
analyzer = PmdarimaAnalyzer(
df=df,
group_key_columns=["country", "region"],
y_col="orders",
datetime_col="date"
)
Decompose Trends
The diviner.PmdarimaAnalyzer.decompose_groups() method will decompose each series into its component parts:
trend
seasonal
random (also known as ‘residuals’)
The output of this method is a union of each group’s decomposed trends in a single DataFrame that retains the group
key information in columns along with the extracted components from the series data.
This method is mainly used for validation of a new project.
Example:
decomposed_trends = analyzer.decompose_groups(m=7, type="additive")
Arguments to the diviner.PmdarimaAnalyzer.decompose_groups() method:
- m
The frequency value of the endogenous series data. The integer supplied is a measure of the repeatable pattern of the estimated seasonality effect. For instance,
7would be appropriate for daily measured data,24would be a good starting point for hourly data, and52would be a good initial validation value for weekly data.- type (
'type_') The type of decomposition to perform. One of:
"additive"or"multiplicative". A good rule of thumb for determining which of these to choose is to determine whether the seasonality effects either stay constant as a function of the trend (which would be “additive”) or, if the seasonality effect is a function of the baseline trend value, “multiplicative” would be more appropriate. For further explanation, see here.- filter (
'filter_') [Optional] Reverse-sorted Array for performing convolution on the coefficients of either the
MAterms or theARterms.Default:
None
Calculate Differencing Term
Isolating the differencing term 'd' can provide significant performance improvements if AutoARIMA is used as the
underlying estimator for each series. This method provides a means of estimating these per-group differencing terms.
The output is returned as a dictionary of {<group_key>: d}
Note
This utility method is intended to be used as an input to the diviner.GroupedPmdarima.fit() method when using
AutoARIMA as a base group estimator. It will set per-group values of d so that the AutoARIMA optimizer
does not need to search for values of the differencing term, saving a great deal of computation time.
Example:
diffs = analyzer.calculate_ndiffs(alpha=0.1, test="kpss", max_d=5)
Arguments to the diviner.PmdarimaAnalyzer.calculate_ndiffs() method:
- alpha
The significance value used in determining if a pvalue for a test of an estimated
dterm is significant or not. Default:0.05- test
The stationarity unit test used to determine significance for a tested
dterm.Allowable values:
Default:
"kpss"- max_d
The maximum allowable differencing term to test. Default:
2
Calculate Seasonal Differencing Term
Isolating the seasonal differencing term D can provide a significant performance improvement to seasonal models
which are activated by setting the m term in the base group estimator. The functionality of this
diviner.PmdarimaAnalyzer.calculate_nsdiffs() method is similar to that of calculate_ndiffs, except for the
seasonal differencing term.
Example:
seasonal_diffs = analyzer.calculate_nsdiffs(m=7, test="ocsb", max_D=5)
Arguments to the calculate_nsdiffs method:
- m
The frequency of seasonal periods within the endogenous series. The integer supplied is a measure of the repeatable pattern of the estimated seasonality effect. For instance,
7would be appropriate for daily measured data,24would be a good starting point for hourly data, and52would be a good initial validation value for weekly data.- test
The seasonality unit test used to determine an optimal seasonal differencing
Dterm.Allowable tests:
Default:
"ocsb"- max_D
The maximum allowable seasonal differencing term to test.
Default:
2
Calculate Constancy
The constancy check is a data set utility validation tool that operates on each grouped series, determining whether or not it can be modeled.
The output of this validation check method diviner.PmdarimaAnalyzer.calculate_is_constant() is a dictionary of
{<group_key>: <Boolean constancy check>}. Any group with a True result is ineligible for modeling as this
indicates that the group has only a single constant value for each datetime period.
Example:
constancy_checks = analyzer.calculate_is_constant()
Calculate Auto Correlation Function
The diviner.PmdarimaAnalyzer.calculate_acf() method is used for calculating the auto-correlation function for
each series group. The auto-correlation function values can be used (in conjunction with the result of partial
auto-correlation function results) to select restrictive search values for the ordering
terms for AutoARIMA or to manually set the ordering terms ((p, d, q)) for ARIMA.
Note
The general rule to determine whether to use an AR, MA, or ARMA configuration for ARIMA or AutoARIMA is
as follows:
ACF gradually trend to significance, PACF significance achieved after 1 lag -> AR model
ACF significance after 1 lag, PACF gradually trend to significance -> MA model
ACF gradually trend to significance, PACF gradually trend to significance -> ARMA model
These results can help to set the order terms of an ARIMA model (p and q) or, for AutoARIMA, set restrictions on maximum search space terms to assist in faster optimization of the model.
Arguments to the calculate_acf method:
- unbiased
auto-covariance denominator flag with values of:
True-> denominator =n - kFalse-> denominator =n
- nlags
The number of auto-correlation lags to calculate and return.
Default:
40- qstat
Boolean flag to calculate and return the Q statistic from the Ljung-Box test.
Default:
False- fft
Whether to perform a fast fourier transformation of the series to calculate the auto-correlation function. For large time series, it is highly recommended to set this to
True. Allowable values:True,False, orNone.Default:
None- alpha
If specified as a float, calculates and returns confidence intervals at this certainty level for the auto-correlation function values. For example, if alpha=0.1, 90% confidence intervals are calculated and returned wherein the standard deviation is computed according to Bartlett’s formula.
Default:
None- missing
Handling of
NaNvalues in series data. Available options are:None- no validation checks are performed.'raise'- an Exception is raised if a missing value is detected.'conservative'-NaNvalues are removed from the mean and cross-product calculations but are not removed from the series data.'drop'-NaNvalues are removed from the series data.
Default:
None- adjusted
Deprecation handler for the underlying
statsmodelsarguments that have become theunbiasedargument. This is a duplicated value for the denominator mode of calculation for the autocovariance of the series.Default:
False
Calculate Partial Auto Correlation Function
The diviner.PmdarimaAnalyzer.calculate_pacf() method is used for determining the partial auto-correlation function
for each series group. When combined with Calculate Auto Correlation Function results, ordering values can be estimated (or controlled
in search space scope for AutoARIMA). See the notes in Calculate Auto Correlation Function for how to use the results from these two methods.
Arguments to the calculate_pacf method:
- nlags
The number of partial auto-correlation lags to calculate and return.
Default:
40- method
The method employed for calculating the partial auto-correlation function. Methods and their explanations are listed in the pmdarima docs.
Default:
'ywadjusted'- alpha
If specified as a float, calculates and returns confidence intervals at this certainty level for the auto-correlation function values. For example, if alpha=0.1, 90% confidence intervals are calculated and returned wherein the standard deviation is computed according to Bartlett’s formula.
Default:
None
Generate Diff
The utility method diviner.PmdarimaAnalyzer.generate_diff() will generate lag differences for each group.
While not applicable to most timeseries modeling problems, it can prove to be useful in certain situations or as a
diagnostic tool to investigate why a particular series is not fitting properly.
Arguments for this method:
- lag
The magnitude of the lag used in calculating the differencing. Default:
1- differences
The order of the differencing to be performed. Default:
1
For an illustrative example, see the diff example.
Generate Diff Inversion
The utility method diviner.PmdarimaAnalyzer.generate_diff_inversion() will invert a previously differenced
grouped series.
Arguments for this method:
- group_diff_data
The differenced data from the usage of
diviner.PmdarimaAnalyzer.generate_diff().- lag
The magnitude of the lag that was used in the differencing function in order to revert the diff.
Default:
1- differences
The order of the differencing that was performed using
diviner.PmdarimaAnalyzer.generate_diff()so that the series data can be reverted.Default:
1- recenter
If
Trueand'series_start'exists ingroup_diff_datadict, will restore the original series range for each group based on the series start value calculated through thegenerate_diff()method. If thegroup_diff_datadoes not contain the starting values, the data will not be re-centered.Default:
False
Class Signature of PmdarimaAnalyzer
- class diviner.PmdarimaAnalyzer(df, group_key_columns, y_col, datetime_col)[source]
- calculate_acf(unbiased=False, nlags=None, qstat=False, fft=None, alpha=None, missing='none', adjusted=False)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility for calculating the autocorrelation function for each group. Combined with a partial autocorrelation function calculation, the return values can greatly assist in setting AR, MA, or ARMA terms for a given model.
The general rule to determine whether to use an AR, MA, or ARMA configuration for ARIMA (or AutoARIMA) is as follows:
ACF gradually trend to significance, PACF significance achieved after 1 lag -> AR model
ACF significance after 1 lag, PACF gradually trend to significance -> MA model
ACF gradually trend to significance, PACF gradually trend to significance -> ARMA model
These results can help to set the order terms of an ARIMA model (p and q) or, for AutoARIMA, set restrictions on maximum search space terms to assist in faster optimization of the model.
- Parameters
unbiased –
Boolean flag that sets the autocovariance denominator to
'n-k'ifTrueandnifFalse.Note: This argument is deprecated and removed in versions of pmdarima > 2.0.0
Default:
Falsenlags –
The count of autocorrelation lags to calculate and return.
Default:
40qstat –
Boolean flag to calculate and return the Ljung-Box statistic for each lag.
Default:
Falsefft –
Boolean flag for whether to use fast fourier transformation (fft) for computing the autocorrelation function. FFT is recommended for large time series data sets.
Default:
Nonealpha –
If specified, calculates and returns the confidence intervals for the acf values at the level set (i.e., for 90% confidence, an alpha of 0.1 would be set)
Default:
Nonemissing –
handling of NaN values in the series data.
Available options:
['none', 'raise', 'conservative', 'drop'].none: no checks are performed.raise: an Exception is raised if NaN values are in the series.conservative: the autocovariance is calculated by removing NaN values from the mean and cross-product calculations but are not eliminated from the series.drop:NaNvalues are removed from the series and adjacent values toNaN’s are treated as contiguous (which may invalidate the results in certain situations).Default:
'none'adjusted – Deprecation handler for the underlying
statsmodelsarguments that have become theunbiasedargument. This is a duplicated value for the denominator mode of calculation for the autocovariance of the series.
- Returns
Dictionary of
{<group_key>: {<acf terms>: <values as array>}}
- calculate_is_constant()[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility method for determining whether or not a series is composed of all of the same elements or not. (e.g. a series of {1, 2, 3, 4, 5, 1, 2, 3} will return ‘False’, while a series of {1, 1, 1, 1, 1, 1, 1, 1, 1} will return ‘True’)
- Returns
Dictionary of
{<group_key>: <Boolean constancy check>}
- calculate_ndiffs(alpha=0.05, test='kpss', max_d=2)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility method for determining the optimal
dvalue for ARIMA ordering. Calculating this as a fixed value can dramatically increase the tuning time forpmdarimamodels.- Parameters
alpha –
significance level for determining if a pvalue used for testing a value of
'd'is significant or not.Default:
0.05test –
Type of unit test for stationarity determination to use. Supported values:
['kpss', 'adf', 'pp']See:https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.KPSSTest. html#pmdarima.arima.KPSSTest
https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.PPTest. html#pmdarima.arima.PPTest
https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.ADFTest. html#pmdarima.arima.ADFTest
Default:
'kpss'max_d – The max value for
dto test.
- Returns
Dictionary of
{<group_key>: <optimal 'd' value>}
- calculate_nsdiffs(m, test='ocsb', max_D=2)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
- Utility method for determining the optimal
Dvalue for seasonalSARIMAXordering of ('P', 'D', 'Q').
- Parameters
m – The number of seasonal periods in the series.
test –
Type of unit test for seasonality. Supported tests:
['ocsb', 'ch']See:https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.OCSBTest. html#pmdarima.arima.OCSBTest
https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.CHTest. html#pmdarima.arima.CHTest
Default:
'ocsb'max_D –
Maximum number of seasonal differences to test for.
Default: 2
- Returns
Dictionary of
{<group_key>: <optimal 'D' value>}
- Utility method for determining the optimal
- calculate_pacf(nlags=None, method='ywadjusted', alpha=None)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility for calculating the partial autocorrelation function for each group. In conjunction with the autocorrelation function
calculate_acf, the values returned from a pacf calculation can assist in setting values or bounds on AR, MA, and ARMA terms for an ARIMA model.The general rule to determine whether to use an AR, MA, or ARMA configuration for
ARIMA(orAutoARIMA) is as follows:ACF gradually trend to significance, PACF significance achieved after 1 lag -> AR model
ACF significance after 1 lag, PACF gradually trend to significance -> MA model
ACF gradually trend to significance, PACF gradually trend to significance -> ARMA model
These results can help to set the order terms of an ARIMA model (
pandq) or, forAutoARIMA, set restrictions on maximum search space terms to assist in faster optimization of the model.- Parameters
nlags –
The count of partial autocorrelation lags to calculate and return.
Default:
40method –
The method used for pacf calculation. See the
pmdarimadocs for full listing of methods:https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.utils.pacf.html
Default:
'ywadjusted'alpha –
If specified, returns confidence intervals based on the alpha value supplied.
Default:
None
- Returns
Dictionary of
{<group_key>: {<pacf terms>: <values as array>}}
- decompose_groups(m, type_, filter_=None)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
Utility method that wraps
pmdarima.arima.decompose()for each group within the passed-in DataFrame. Note: decomposition works best if the total number of entries within the series being decomposed is a multiple of the m parameter value.- Parameters
m – The frequency of the endogenous series. (i.e., for daily data, an
mvalue of'7'would be appropriate for estimating a weekly seasonality, while settingmto'365'would be effective for yearly seasonality effects.)type –
The type of decomposition to perform. One of:
['additive', 'multiplicative']See: https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima. decompose.html
filter –
Optional Array for performing convolution. This is specified as a filter for coefficients (the Moving Average and/or Auto Regressor coefficients) in reverse time order in order to filter out a seasonal component.
Default: None
- Returns
Pandas DataFrame with the decomposed trends for each group.
- generate_diff(lag=1, differences=1)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
A utility for generating the array diff (lag differences) for each group. To support invertability, this method will return the starting value of each array as well as the differenced values.
- Parameters
lag –
Determines the magnitude of the lag to calculate the differencing function for.
Default:
1differences –
The order of the differencing to be performed. Note that values > 1 will generate n fewer results.
Default:
1
- Returns
Dictionary of
{<group_key>: {"series_start": <float>, "diff": <diff_array>}}
- static generate_diff_inversion(group_diff_data, lag=1, differences=1, recenter=False)[source]
Note
Experimental: This method may change, be moved, or removed in a future release with no prior warning.
A utility for inverting a previously differenced group of timeseries data. This utility supports returning each group’s series data to the original range of the data if the recenter argument is set to True and the start conditions are contained within the
group_diff_dataargument’s dictionary structure.- Parameters
group_diff_data – Differenced payload consisting of a dictionary of
{<group_key>: {'diff': <differenced data>, [optional]'series_start': float}}lag –
The lag to use to perform the differencing inversion.
Default:
1differences –
The order of differencing to be used during the inversion.
Default:
1recenter –
If
Trueand'series_start'exists ingroup_diff_datadict, will restore the original series range for each group based on the series start value calculated through thegenerate_diff()method. If thegroup_diff_datadoes not contain the starting values, the data will not be re-centered.Default:
False
- Returns
Dictionary of
{<group_key>: <series_inverted_data>}