Grouped pmdarima

The Grouped pmdarima API is a multi-series orchestration framework for building multiple individual models of related, but isolated series data. For example, a project that required the forecasting of inventory demand at regional warehouses around the world would historically require individual orchestration of data acquisition, hyperparameter definitions, model training, metric validation, serialization, and registration of tens of thousands of individual models based on the permutations of SKU and warehouse location.

This API consolidates the many thousands of models that would otherwise need to be implemented, trained individually, and managed throughout their frequent retraining and forecasting lifecycles to a single high-level API that simplifies these common use cases that rely on the pmdarima forecasting library.

Table of Contents

Grouped pmdarima API 

The following sections provide a basic overview of using the GroupedPmdarima API, from fitting of the grouped models, predicting forecasted data, saving, loading, and customization of the underlying pmdarima instances.

To see working end-to-end examples, you can go to Tutorials and Examples. The examples will allow you to explore the data structures required for training, how to extract forecasts for each group, and demonstrations of the saving and loading of trained models.

Base Estimators and API interface 

The usage of the GroupedPmdarima API is slightly different from the other grouped forecasting library wrappers within Diviner. This is due to the ability of pmdarima to support multiple modes of configuration.

These modes that are available to construct a model are:

Passing an ARIMA model template (wrapper around statsmodels ARIMA)
Using the native pmdarima AutoARIMA model template
Constructing a pmdarima Pipeline template

The GroupedPmdarima implementation requires the submission of one of these 3 model templates to set the base configured model architecture for each group.

For example:

from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima

# Define the base ARIMA with a preset ordering parameter
base_arima_model = ARIMA(order=(1, 0, 2))

# Define the model template in the GroupedPmdarima constructor
grouped_arima = GroupedPmdarima(model_template=base_arima_model)

The above example is intended only to showcase the interface between a base estimator (base_arima_model) and the instance constructor for GroupedPmdarima. For a more in-depth and realistic example of utilizing an ARIMA model manually, see the additional statistical validation steps that would be required for this in the Tutorials and Examples section of the docs.

Model fitting 

In order to fit a GroupedPmdarima model instance, the fit method is used. Calling this method will process the input DataFrame to create a grouped execution collection, fit a pmdarima model type on each individual series, and persist the trained state of each group’s model to the object instance.

The arguments for the fit method are:

df: A ‘normalized’ DataFrame that contains an endogenous regressor column (the ‘y’ column), a date (or datetime) column (that defines the ordering, periodicity, and frequency of each series (if this column is a string, the frequency will be inferred)), and grouping column(s) that define the discrete series to be modeled. For further information on the structure of this DataFrame, see the quickstart guide
group_key_columns: The names of the columns within df that, when combined (in order supplied) define distinct series. See the quickstart guide for further information.
y_col: Name of the endogenous regressor term within the DataFrame argument df. This column contains the values of the series that are used during training.
datetime_col: Name of the column within the df argument DataFrame that defines the datetime ordering of the series data.
exog_cols: [Optional] A collection of column names within the submitted data that contain exogenous regressor elements to use as part of model fitting and predicting. The data within each column will be assembled into a 2D array for use in the regression.

Note

pmdarima currently has exogeneous regressor support marked as a future deprecated feature. Usage of this functionality is not recommended except for existing legacy implementations.

ndiffs: [Optional] A dictionary of {<group_key>: <d value>} for the differencing term for each group. This is intended to function alongside the output from the diviner.PmdarimaAnalyzer.calculate_ndiffs() method, serving to reduce the search space for AutoARIMA by supplying fixed d values to each group’s model.
nsdiffs: [Optional] A dictionary of {<group_key>: <D value>} for the seasonal differencing term for each group. This is intended to function alongside the output from the diviner.PmdarimaAnalyzer.calculate_nsdiffs() method, serving to reduce the search space for AutoARIMA by supplying fixed D values to each group’s model.

Note

These values will only be used if the models being fit are seasonal models. The value m must be set on the underlying ARIMA or AutoARIMA model for seasonality order components to be used.

silence_warnings: [Optional] Whether to silence stdout reporting of the underlying pmdarima fit process. Default: False.
fit_kwargs: [Optional] fit_kwargs for pmdarima ARIMA, AutoARIMA, or Pipeline stages overrides. For more information, see the pmdarima docs

Example:

from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima

base_arima_model = ARIMA(order=(1, 0, 2))

grouped_arima = GroupedPmdarima(model_template=base_arima_model)

grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")

Predict 

The predict method generates forecast data for each grouped series within the meta diviner.GroupedPmdarima model.

Example:

from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima

base_arima_model = ARIMA(order=(1, 0, 2))

grouped_arima = GroupedPmdarima(model_template=base_arima_model)

grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")

forecasts = grouped_arima_model.predict(n_periods=30)

The arguments for the predict method are:

n_periods: The number of future periods to generate from the end of each group’s series. The first value of the prediction forecast series will begin at one periodicity value after the end of the training series. For example, if the training series was of daily data from 2019-10-01 to 2021-10-02, the start of the prediction series output would be 2021-10-03 and continue for n_periods days from that point.
predict_col: [Optional] The name to use for the generated column containing forecasted data. Default: "yhat"
alpha: [Optional] Confidence interval significance value for error estimates. Default: 0.05.

Note

alpha is only used if the boolean flag return_conf_int is set to True.

return_conf_int: [Optional] Boolean flag for whether or not to calculate confidence intervals for the predicted forecasts. If True, the columns "yhat_upper" and "yhat_lower" will be added to the output DataFrame for the upper and lower confidence intervals for the predictions.
inverse_transform: [Optional] Used exclusively for Pipeline based models that include an endogeneous transformer such as BoxCoxEndogTransformer or LogEndogTransformer. Default: True (although it only applies if the model_template type passed in is a Pipeline that contains a transformer). An inversion of the endogeneous regression term can be helpful for distributions that are highly non-normal. For further reading on what the purpose of these functions are, why they are used, and how they might be applicable to a given time series, see data transformation.
exog: [Optional] If the original model was trained with an exogeneous regressor elements, the prediction will require these 2D arrays at prediction time. This argument is used to hold the 2D array of future exogeneous regressor values to be used in generating the prediction for the regressor.
predict_kwargs: [Optional] Extra kwarg arguments for any of the transform stages of a Pipeline or for additional predict kwargs to the model instance. Pipeline kwargs are specified in the manner of sklearn Pipeline format (i.e., <stage_name>__<arg name>=<value>. e.g., to change the values of a fourier transformer at prediction time, the override would be: {'fourier__n_periods': 45})

Predict Groups 

The predict_groups method generates forecast data for a subset of groups that a diviner.GroupedPmdarima model was trained upon.

Example:

from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima

base_model = ARIMA(order=(2, 1, 2))

grouped_arima = GroupedPmdarima(model_template=base_model)

model = grouped_arima.fit(df, ["country", "region"], "sales", "date")

subset_forecasts = model.predict_groups(groups=[("US", "NY"), ("FR", "Paris"), ("UA", "Kyiv")], n_periods=90)

The arguments for the predict_groups method are:

groups: A collection of one or more groups for which to generate a forecast. The collection of groups must be submitted as a List[Tuple[str]] to identify the order-specific group values to retrieve the correct model. For instance, if the model was trained with the specified group_key_columns of ["country", "city"], a valid groups entry would be: [("US", "LosAngeles"), ("CA", "Toronto")]. Changing the order within the tuples will not resolve (e.g. [("NewYork", "US")] would not find the appropriate model).

Note

Groups that are submitted for prediction that are not present in the trained model will, by default, cause an Exception to be raised. This behavior can be changed to a warning or ignore status with the argument on_error.
n_periods: The number of future periods to generate from the end of each group’s series. The first value of the prediction forecast series will begin at one periodicity value after the end of the training series. For example, if the training series was of daily data from 2019-10-01 to 2021-10-02, the start of the prediction series output would be 2021-10-03 and continue for n_periods days from that point.
predict_col: [Optional] The name to use for the generated column containing forecasted data. Default: "yhat"
alpha: [Optional] Confidence interval significance value for error estimates. Default: 0.05.

Note

alpha is only used if the boolean flag return_conf_int is set to True.

return_conf_int

[Optional] Boolean flag for whether or not to calculate confidence intervals for the predicted forecasts. If True, the columns "yhat_upper" and "yhat_lower" will be added to the output DataFrame for the upper and lower confidence intervals for the predictions.

inverse_transform

[Optional] Used exclusively for Pipeline based models that include an endogeneous transformer such as BoxCoxEndogTransformer or LogEndogTransformer. Default: True (although it only applies if the model_template type passed in is a Pipeline that contains a transformer). An inversion of the endogeneous regression term can be helpful for distributions that are highly non-normal. For further reading on what the purpose of these functions are, why they are used, and how they might be applicable to a given time series, see this link.

exog

[Optional] If the original model was trained with an exogeneous regressor elements, the prediction will require these 2D arrays at prediction time. This argument is used to hold the 2D array of future exogeneous regressor values to be used in generating the prediction for the regressor.

on_error

[Optional] [Default -> "raise"] Dictates the behavior for handling group keys that have been submitted in the groups argument that do not match with a group identified and registered during training (fit). The modes are:

"raise"
A DivinerException is raised if any supplied groups do not match to the fitted groups.
"warn"
A warning is emitted (printed) and logged for any groups that do not match to those that the model was fit with.
"ignore"
Invalid groups will silently fail prediction.

Note

A DivinerException will still be raised even in "ignore" mode if there are no valid fit groups to match the provided groups provided to this method.

predict_kwargs

[Optional] Extra kwarg arguments for any of the transform stages of a Pipeline or for additional predict kwargs to the model instance. Pipeline kwargs are specified in the manner of sklearn Pipeline format (i.e., <stage_name>__<arg name>=<value>. e.g., to change the values of a fourier transformer at prediction time, the override would be: {'fourier__n_periods': 45})

Save 

Saves a GroupedPmdarima instance that has been fit. The serialization of the model instance uses a base64 encoding of the pickle serialization of each model instance within the grouped structure.

Example:

from pmdarima.arima.arima import ARIMA
from diviner import GroupedPmdarima

base_arima_model = ARIMA(order=(1, 0, 2))

grouped_arima = GroupedPmdarima(model_template=base_arima_model)

grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")

save_location = "/path/to/saved/model"

grouped_arima_model.save(save_location)

Load 

Loads a GroupedPmdarima serialized model from a storage location.

Example:

from diviner import GroupedPmdarima

load_location = "/path/to/saved/model"

loaded_model = GroupedPmdarima.load(load_location)

Note

The load method is a class method. As such, the initialization argument model_template does not need to be provided. It will be set on the loaded object based on the template that was provided during initial training before serialization.

Utilities 

Parameter Extraction 

To extract the parameters that are either explicitly (or, in the case of AutoARIMA, selectively) set during the fitting of each individual model contained within the grouped collection, the method get_model_params is used to extract the per-group parameters for each model into an output Pandas DataFrame.

Note

The parameters can only be extracted from a GroupedPmdarima model that has been fit.

Example:

from pmdarima.arima.auto import AutoARIMA
from diviner import GroupedPmdarima

base_arima_model = AutoARIMA(max_order=7, d=1, m=7, max_iter=1000)

grouped_arima = GroupedPmdarima(model_template=base_arima_model)

grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")

fit_parameters = grouped_arima_model.get_model_params()

Metrics Extraction 

This functionality allows for the retrieval of fitting metrics that are attached to the underlying SARIMA model. These are not the typical loss metrics that can be calculated through cross validation backtesting.

The metrics that are returned from fitting are:

hqic (Hannan-Quinn information criterion)
aicc (Corrected Akaike information criterion; aic for small sample sizes)
oob (out of bag error)
bic (Bayesian information criterion)
aic (Akaike information criterion)

Note

Out of bag error metric (oob) is only calculated if the underlying ARIMA model has a value set for the argument out_of_sample_size. See out-of-bag-error for more information.

Example:

from pmdarima.arima.auto import AutoARIMA
from diviner import GroupedPmdarima

base_arima_model = AutoARIMA(max_order=7, d=1, m=7, max_iter=1000)

grouped_arima = GroupedPmdarima(model_template=base_arima_model)

grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")

fit_metrics = grouped_arima_model.get_metrics()

Cross Validation Backtesting 

Cross validation utilizing backtesting is the primary means of evaluating whether a given model will perform robustly in generating forecasts based on time period horizon events throughout the historical series.

In order to use the cross validation functionality in the method diviner.GroupedPmdarima.cross_validate(), one of two windowing split objects must be passed into the method signature:

Arguments to the diviner.GroupedPmdarima.cross_validate() method:

df

The original source DataFrame that was used during diviner.GroupedPmdarima.fit() that contains the endogenous series data. This DataFrame must contain the columns that define the constructed groups (i.e., missing group data will not be scored and groups that are not present in the model object will raise an Exception).

metrics

A collection of metric names to be used for evaluation. Submitted metrics must be one or more of:

Default: {"smape", "mean_absolute_error", "mean_squared_error"}

cross_validator

The cross validation object (either RollingForecastCV or SlidingWindowForecastCV). See the example below for how to submit this object to the diviner.GroupedPmdarima.cross_validate() method.

error_score

The default value to assign to a window evaluation if an error occurs during loss calculation.

Default: np.nan

In order to throw an Exception, a str value of “raise” can be provided. Otherwise, supply a float.

verbosity

Level of verbosity to print during training and cross validation. The lower the integer value, the fewer lines of debugging text is printed to stdout.

Default: 0 (no printing)

Example:

from pmdarima.arima.auto import AutoARIMA
from pmdarima.model_selection import SlidingWindowForecastCV
from diviner import GroupedPmdarima

base_arima_model = AutoARIMA(max_order=7, d=1, m=7, max_iter=1000)

grouped_arima = GroupedPmdarima(model_template=base_arima_model)

grouped_arima_model = grouped_arima.fit(df, ["country", "region"], "sales", "date")

cv_window = SlidingWindowForecastCV(h=28, step=180, window_size=365)

grouped_arima_cv = grouped_arima_model.cross_validate(df=df,
                                                      metrics=["mean_squared_error", "smape"],
                                                      cross_validator=cv_window,
                                                      error_score=np.nan,
                                                      verbosity=1
                                                     )

Class Signature of GroupedPmdarima 

class diviner.GroupedPmdarima(model_template)[source]

cross_validate(df, metrics, cross_validator, error_score=nan, verbosity=0)[source]

Method for performing cross validation on each group of the fit model. The supplied cross_validator to this method will be used to perform either rolling or shifting window prediction validation throughout the data set. Windowing behavior for the cross validation must be defined and configured through the cross_validator that is submitted. See: https://alkaline-ml.com/pmdarima/modules/classes.html#cross-validation-split-utilities for details on the underlying implementation of cross validation with pmdarima.

Parameters

df – A DataFrame that contains the endogenous series and the grouping key columns that were defined during training. Any missing key entries will not be scored. Note that each group defined within the model will be retrieved from this DataFrame. keys that do not exist will raise an Exception.
metrics – A list of metric names or string of single metric name to use for cross validation metric calculation.
cross_validator – A cross validator instance from pmdarima.model_selection (RollingForecastCV or SlidingWindowForecastCV). Note: setting low values of h or step will dramatically increase execution time).
error_score –
Default value to assign to a score calculation if an error occurs in a given window iteration.

Default: np.nan (a silent ignore of the failure)
verbosity –
print verbosity level for pmdarima’s cross validation stages.

Default: 0 (no printing to stdout)

Returns

Pandas DataFrame containing the group information and calculated cross validation metrics for each group.

fit(df, group_key_columns, y_col: str, datetime_col: str, exog_cols: Optional[List[str]] = None, ndiffs: Optional[Dict] = None, nsdiffs: Optional[Dict] = None, silence_warnings: bool = False, **fit_kwargs)[source]

Fit method for training a pmdarima model on the submitted normalized DataFrame. When initialized, the input DataFrame will be split into an iterable collection of grouped data sets based on the group_key_columns arguments, which is then used to fit individual pmdarima models (or a supplied Pipeline) upon the templated object supplied as a class instance argument model_template. For API information for pmdarima’s ARIMA, AutoARIMA, and Pipeline APIs, see: https://alkaline-ml.com/pmdarima/modules/classes.html#api-ref

Parameters

df –
A normalized group data set consisting of a datetime column that defines ordering of the series, an endogenous regressor column that specifies the series data for training (e.g. y_col), and column(s) that define the grouping of the series data.

An example normalized data set:

region

zone

country

ds

y

’northeast’

1

”US”

”2021-10-01”

1234.5

’northeast’

2

”US”

”2021-10-01”

3255.6

’northeast’

1

”US”

”2021-10-02”

1255.9

Wherein the grouping_key_columns could be one, some, or all of ['region', 'zone', 'country'], the datetime_col would be the ‘ds’ column, and the series y_col (endogenous regressor) would be ‘y’.
group_key_columns – The columns in the df argument that define, in aggregate, a unique time series entry. For example, with the DataFrame referenced in the df param, group_key_columns could be: ('region', 'zone') or ('region') or ('country', 'region', 'zone')
y_col – The name of the column within the DataFrame input to any method within this class that contains the endogenous regressor term (the raw data that will be used to train and use as a basis for forecasting).
datetime_col – The name of the column within the DataFrame input that defines the datetime or date values associated with each row of the endogenous regressor (y_col) data.
exog_cols –
An optional collection of column names within the submitted data to class methods that contain exogenous regressor elements to use as part of model fitting and predicting.

Default: None
ndiffs –
optional overrides to the d ARIMA differencing term for stationarity enforcement. The structure of this argument is a dictionary in the form of: {<group_key>: <d_term>}. To calculate, use diviner.PmdarimaAnalyzer.calculate_ndiffs()

Default: None
nsdiffs –
optional overrides to the D SARIMAX seasonal differencing term for seasonal stationarity enforcement. The structure of this argument is a dictionary in the form of: {<group_key>: <D_term>}. To calculate, use :py:meth:diviner.PmdarimaAnalyzer.calculate_nsdiffs

Default: None
silence_warnings –
If True, removes SARIMAX and underlying optimizer warning message from stdout printing. With a sufficiently large nubmer of groups to process, the volume of these messages to stdout may become very large.

Default: False
fit_kwargs – fit_kwargs for pmdarima’s ARIMA, AutoARIMA, or Pipeline stage overrides. For more information, see the pmdarima docs: https://alkaline-ml.com/pmdarima/index.html

Returns

object instance of GroupedPmdarima with the persisted fit model attached.

get_metrics()[source]

Retrieve the ARIMA fit metrics that are generated during the AutoARIMA or ARIMA training event. Note: These metrics are not validation metrics. Use the cross_validate() method for retrieving back-testing error metrics.

Returns: Pandas DataFrame with metrics provided as columns and a row entry per group.

get_model_params()[source]

Retrieve the parameters from the fit model_template that was passed in and return them in a denormalized Pandas DataFrame. Parameters in the return DataFrame are columns with a row for each group defined during fit().

Returns: Pandas DataFrame with fit parameters for each group.

classmethod load(path: str)[source]

Load a GroupedPmdarima instance from a saved serialized version. Note: This is a class instance and as such, a GroupedPmdarima instance does not need to be initialized in order to load a saved model. For example: loaded_model = GroupedPmdarima.load(<location>)

Parameters: path – The path to a serialized instance of GroupedPmdarima
Returns: The GroupedPmdarima instance that was saved.

predict(n_periods: int, predict_col: str = 'yhat', alpha: float = 0.05, return_conf_int: bool = False, inverse_transform: bool = True, exog=None, **predict_kwargs)[source]

Prediction method for generating forecasts for each group that has been trained as part of a call to fit(). Note that pmdarima’s API does not support predictions outside of the defined datetime frequency that was validated during training (i.e., if the series endogenous data is at an hourly frequency, the generated predictions will be at an hourly frequency and cannot be modified from within this method).

Parameters

n_periods – The number of future periods to generate. The start of the generated predictions will be 1 frequency period after the maximum datetime value per group during training. For example, a data set used for training that has a datetime frequency in days that ends on 7/10/2021 will, with a value of n_periods=7, start its prediction on 7/11/2021 and generate daily predicted values up to and including 7/17/2021.
predict_col –
The name to be applied to the column containing predicted data.

Default: 'yhat'
alpha –
Optional value for setting the confidence intervals for error estimates. Note: this is only utilized if return_conf_int is set to True.

Default: 0.05 (representing a 95% CI)
return_conf_int –
Boolean flag for whether to calculate confidence interval error estimates for predicted values. The intervals of yhat_upper and yhat_lower are based on the alpha parameter.

Default: False
inverse_transform –
Optional argument used only for Pipeline models that include either a BoxCoxEndogTransformer or a LogEndogTransformer.

Default: True
exog –
Exogenous regressor components as a 2-D array. Note: if the model is trained with exogenous regressor components, this argument is required.

Default: None
predict_kwargs – Extra kwarg arguments for any of the transform stages of a Pipeline or for additional predict kwargs to the model instance. Pipeline kwargs are specified in the manner of sklearn Pipeline format (i.e., <stage_name>__<arg name>=<value>. e.g., to change the values of a fourier transformer at prediction time, the override would be: {'fourier__n_periods': 45})

Returns

A consolidated (unioned) single DataFrame of predictions per group.

predict_groups(groups: List[Tuple[str]], n_periods: int, predict_col: str = 'yhat', alpha: float = 0.05, return_conf_int: bool = False, inverse_transform: bool = False, exog=None, on_error: str = 'raise', **predict_kwargs)[source]

This is a prediction method that allows for generating a subset of forecasts based on the collection of keys. By specifying individual groups in the groups argument, a limited scope forecast can be performed without incurring the runtime costs associated with predicting all groups.

Parameters

groups –
List[Tuple[str]] the collection of group (s) to generate forecast predictions. The group definitions must be the values within the group_key_columns that were used during the fit of the model in order to return valid forecasts.

Note

The positional ordering of the values are important and must match the order of group_key_columns for the fit argument to provide correct prediction forecasts.
n_periods – The number of row events to forecast
predict_col – The name of the column in the output DataFrame that contains the forecasted series data. Default: "yhat"
alpha –
Optional value for setting the confidence intervals for error estimates. Note: this is only utilized if return_conf_int is set to True.

Default: 0.05 (representing a 95% CI)
return_conf_int –
Boolean flag for whether to calculate confidence interval error estimates for predicted values. The intervals of yhat_upper and yhat_lower are based on the alpha parameter.

Default: False
inverse_transform –
Optional argument used only for Pipeline models that include either a BoxCoxEndogTransformer or a LogEndogTransformer.

Default: False
exog –
Exogenous regressor components as a 2-D array. Note: if the model is trained with exogenous regressor components, this argument is required.

Default: None
predict_kwargs – Extra kwarg arguments for any of the transform stages of a Pipeline or for additional predict kwargs to the model instance. Pipeline kwargs are specified in the manner of sklearn Pipeline format (i.e., <stage_name>__<arg name>=<value>. e.g., to change the values of a fourier transformer at prediction time, the override would be: {'fourier__n_periods': 45})
on_error –
Alert level setting for handling mismatched group keys. Default: "raise" The valid modes are:
- ”ignore” - no logging or exception raising will occur if a submitted group key in the groups argument is not present in the model object.
  
  Note
  
  This is a silent failure mode and will not present any indication of a failure to generate forecast predictions.
- ”warn” - any keys that are not present in the fit model will be recorded as logged warnings.
- ”raise” - any keys that are not present in the fit model will cause a DivinerException to be raised.

Returns

A consolidated (unioned) single DataFrame of forecasts for all groups specified in the groups argument.

save(path: str)[source]

Serialize and write the instance of this class (if it has been fit) to the path specified. Note: The serialized model is base64 encoded for top-level items and pickle’d for pmdarima individual group models and any Pandas DataFrame.

Parameters: path – Path to write this model’s instance to.
Returns: None

Grouped pmdarima Analysis tools 

Warning

The PmdarimaAnalyzer module is in experimental mode. The methods and signatures are subject to change in the future with no deprecation warnings.

As a companion to Diviner’s diviner.GroupedPmdarima class, an analysis toolkit class is provided. Contained within this class, PmdarimaAnalyzer, are the following utility methods:

See below for a brief description of each of these utility methods that are available for group processing through the PmdarimaAnalyzer API.

Object instantiation:

from diviner import PmdarimaAnalyzer

analyzer = PmdarimaAnalyzer(
    df=df,
    group_key_columns=["country", "region"],
    y_col="orders",
    datetime_col="date"
)

Decompose Trends 

The diviner.PmdarimaAnalyzer.decompose_groups() method will decompose each series into its component parts:

trend
seasonal
random (also known as ‘residuals’)

The output of this method is a union of each group’s decomposed trends in a single DataFrame that retains the group key information in columns along with the extracted components from the series data.

This method is mainly used for validation of a new project.

Example:

decomposed_trends = analyzer.decompose_groups(m=7, type="additive")

Arguments to the diviner.PmdarimaAnalyzer.decompose_groups() method:

m

The frequency value of the endogenous series data. The integer supplied is a measure of the repeatable pattern of the estimated seasonality effect. For instance, 7 would be appropriate for daily measured data, 24 would be a good starting point for hourly data, and 52 would be a good initial validation value for weekly data.

type ('type_')

The type of decomposition to perform. One of: "additive" or "multiplicative". A good rule of thumb for determining which of these to choose is to determine whether the seasonality effects either stay constant as a function of the trend (which would be “additive”) or, if the seasonality effect is a function of the baseline trend value, “multiplicative” would be more appropriate. For further explanation, see here.

filter ('filter_')

[Optional] Reverse-sorted Array for performing convolution on the coefficients of either the MA terms or the AR terms.

Default: None

Calculate Differencing Term 

Isolating the differencing term 'd' can provide significant performance improvements if AutoARIMA is used as the underlying estimator for each series. This method provides a means of estimating these per-group differencing terms.

The output is returned as a dictionary of {<group_key>: d}

Note

This utility method is intended to be used as an input to the diviner.GroupedPmdarima.fit() method when using AutoARIMA as a base group estimator. It will set per-group values of d so that the AutoARIMA optimizer does not need to search for values of the differencing term, saving a great deal of computation time.

Example:

diffs = analyzer.calculate_ndiffs(alpha=0.1, test="kpss", max_d=5)

Arguments to the diviner.PmdarimaAnalyzer.calculate_ndiffs() method:

alpha

The significance value used in determining if a pvalue for a test of an estimated d term is significant or not. Default: 0.05

test

The stationarity unit test used to determine significance for a tested d term.

Allowable values:

Default: "kpss"

max_d

The maximum allowable differencing term to test. Default: 2

Calculate Seasonal Differencing Term 

Isolating the seasonal differencing term D can provide a significant performance improvement to seasonal models which are activated by setting the m term in the base group estimator. The functionality of this diviner.PmdarimaAnalyzer.calculate_nsdiffs() method is similar to that of calculate_ndiffs, except for the seasonal differencing term.

Example:

seasonal_diffs = analyzer.calculate_nsdiffs(m=7, test="ocsb", max_D=5)

Arguments to the calculate_nsdiffs method:

m

The frequency of seasonal periods within the endogenous series. The integer supplied is a measure of the repeatable pattern of the estimated seasonality effect. For instance, 7 would be appropriate for daily measured data, 24 would be a good starting point for hourly data, and 52 would be a good initial validation value for weekly data.

test

The seasonality unit test used to determine an optimal seasonal differencing D term.

Allowable tests:

Default: "ocsb"

max_D

The maximum allowable seasonal differencing term to test.

Default: 2

Calculate Constancy 

The constancy check is a data set utility validation tool that operates on each grouped series, determining whether or not it can be modeled.

The output of this validation check method diviner.PmdarimaAnalyzer.calculate_is_constant() is a dictionary of {<group_key>: <Boolean constancy check>}. Any group with a True result is ineligible for modeling as this indicates that the group has only a single constant value for each datetime period.

Example:

constancy_checks = analyzer.calculate_is_constant()

Calculate Auto Correlation Function 

The diviner.PmdarimaAnalyzer.calculate_acf() method is used for calculating the auto-correlation function for each series group. The auto-correlation function values can be used (in conjunction with the result of partial auto-correlation function results) to select restrictive search values for the ordering terms for AutoARIMA or to manually set the ordering terms ((p, d, q)) for ARIMA.

Note

The general rule to determine whether to use an AR, MA, or ARMA configuration for ARIMA or AutoARIMA is as follows:

ACF gradually trend to significance, PACF significance achieved after 1 lag -> AR model
ACF significance after 1 lag, PACF gradually trend to significance -> MA model
ACF gradually trend to significance, PACF gradually trend to significance -> ARMA model

These results can help to set the order terms of an ARIMA model (p and q) or, for AutoARIMA, set restrictions on maximum search space terms to assist in faster optimization of the model.

Arguments to the calculate_acf method:

unbiased

auto-covariance denominator flag with values of:

True -> denominator = n - k
False -> denominator = n

nlags

The number of auto-correlation lags to calculate and return.

Default: 40

qstat

Boolean flag to calculate and return the Q statistic from the Ljung-Box test.

Default: False

fft

Whether to perform a fast fourier transformation of the series to calculate the auto-correlation function. For large time series, it is highly recommended to set this to True. Allowable values: True, False, or None.

Default: None

alpha

If specified as a float, calculates and returns confidence intervals at this certainty level for the auto-correlation function values. For example, if alpha=0.1, 90% confidence intervals are calculated and returned wherein the standard deviation is computed according to Bartlett’s formula.

Default: None

missing

Handling of NaN values in series data. Available options are:

None - no validation checks are performed.
'raise' - an Exception is raised if a missing value is detected.
'conservative' - NaN values are removed from the mean and cross-product calculations but are not removed from the series data.
'drop' - NaN values are removed from the series data.

Default: None

adjusted

Deprecation handler for the underlying statsmodels arguments that have become the unbiased argument. This is a duplicated value for the denominator mode of calculation for the autocovariance of the series.

Default: False

Calculate Partial Auto Correlation Function 

The diviner.PmdarimaAnalyzer.calculate_pacf() method is used for determining the partial auto-correlation function for each series group. When combined with Calculate Auto Correlation Function results, ordering values can be estimated (or controlled in search space scope for AutoARIMA). See the notes in Calculate Auto Correlation Function for how to use the results from these two methods.

Arguments to the calculate_pacf method:

nlags

The number of partial auto-correlation lags to calculate and return.

Default: 40

method

The method employed for calculating the partial auto-correlation function. Methods and their explanations are listed in the pmdarima docs.

Default: 'ywadjusted'

alpha

If specified as a float, calculates and returns confidence intervals at this certainty level for the auto-correlation function values. For example, if alpha=0.1, 90% confidence intervals are calculated and returned wherein the standard deviation is computed according to Bartlett’s formula.

Default: None

Generate Diff 

The utility method diviner.PmdarimaAnalyzer.generate_diff() will generate lag differences for each group. While not applicable to most timeseries modeling problems, it can prove to be useful in certain situations or as a diagnostic tool to investigate why a particular series is not fitting properly.

Arguments for this method:

lag: The magnitude of the lag used in calculating the differencing. Default: 1
differences: The order of the differencing to be performed. Default: 1

For an illustrative example, see the diff example.

Generate Diff Inversion 

The utility method diviner.PmdarimaAnalyzer.generate_diff_inversion() will invert a previously differenced grouped series.

Arguments for this method:

group_diff_data

The differenced data from the usage of diviner.PmdarimaAnalyzer.generate_diff().

lag

The magnitude of the lag that was used in the differencing function in order to revert the diff.

Default: 1

differences

The order of the differencing that was performed using diviner.PmdarimaAnalyzer.generate_diff() so that the series data can be reverted.

Default: 1

recenter

If True and 'series_start' exists in group_diff_data dict, will restore the original series range for each group based on the series start value calculated through the generate_diff() method. If the group_diff_data does not contain the starting values, the data will not be re-centered.

Default: False

Class Signature of PmdarimaAnalyzer 

class diviner.PmdarimaAnalyzer(df, group_key_columns, y_col, datetime_col)[source]

calculate_acf(unbiased=False, nlags=None, qstat=False, fft=None, alpha=None, missing='none', adjusted=False)[source]

Note

Experimental: This method may change, be moved, or removed in a future release with no prior warning.

Utility for calculating the autocorrelation function for each group. Combined with a partial autocorrelation function calculation, the return values can greatly assist in setting AR, MA, or ARMA terms for a given model.

The general rule to determine whether to use an AR, MA, or ARMA configuration for ARIMA (or AutoARIMA) is as follows:

ACF gradually trend to significance, PACF significance achieved after 1 lag -> AR model
ACF significance after 1 lag, PACF gradually trend to significance -> MA model
ACF gradually trend to significance, PACF gradually trend to significance -> ARMA model

These results can help to set the order terms of an ARIMA model (p and q) or, for AutoARIMA, set restrictions on maximum search space terms to assist in faster optimization of the model.

Parameters

unbiased –
Boolean flag that sets the autocovariance denominator to 'n-k' if True and n if False.

Note: This argument is deprecated and removed in versions of pmdarima > 2.0.0

Default: False
nlags –
The count of autocorrelation lags to calculate and return.

Default: 40
qstat –
Boolean flag to calculate and return the Ljung-Box statistic for each lag.

Default: False
fft –
Boolean flag for whether to use fast fourier transformation (fft) for computing the autocorrelation function. FFT is recommended for large time series data sets.

Default: None
alpha –
If specified, calculates and returns the confidence intervals for the acf values at the level set (i.e., for 90% confidence, an alpha of 0.1 would be set)

Default: None
missing –
handling of NaN values in the series data.

Available options:

['none', 'raise', 'conservative', 'drop'].

none: no checks are performed.

raise: an Exception is raised if NaN values are in the series.

conservative: the autocovariance is calculated by removing NaN values from the mean and cross-product calculations but are not eliminated from the series.

drop: NaN values are removed from the series and adjacent values to NaN’s are treated as contiguous (which may invalidate the results in certain situations).

Default: 'none'
adjusted – Deprecation handler for the underlying statsmodels arguments that have become the unbiased argument. This is a duplicated value for the denominator mode of calculation for the autocovariance of the series.

Returns

Dictionary of {<group_key>: {<acf terms>: <values as array>}}

calculate_is_constant()[source]

Note

Experimental: This method may change, be moved, or removed in a future release with no prior warning.

Utility method for determining whether or not a series is composed of all of the same elements or not. (e.g. a series of {1, 2, 3, 4, 5, 1, 2, 3} will return ‘False’, while a series of {1, 1, 1, 1, 1, 1, 1, 1, 1} will return ‘True’)

Returns: Dictionary of {<group_key>: <Boolean constancy check>}

calculate_ndiffs(alpha=0.05, test='kpss', max_d=2)[source]

Note

Experimental: This method may change, be moved, or removed in a future release with no prior warning.

Utility method for determining the optimal d value for ARIMA ordering. Calculating this as a fixed value can dramatically increase the tuning time for pmdarima models.

Parameters

alpha –
significance level for determining if a pvalue used for testing a value of 'd' is significant or not.

Default: 0.05
test –
Type of unit test for stationarity determination to use. Supported values: ['kpss', 'adf', 'pp'] See:

https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.KPSSTest. html#pmdarima.arima.KPSSTest

https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.PPTest. html#pmdarima.arima.PPTest

https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.ADFTest. html#pmdarima.arima.ADFTest

Default: 'kpss'
max_d – The max value for d to test.

Returns

Dictionary of {<group_key>: <optimal 'd' value>}

calculate_nsdiffs(m, test='ocsb', max_D=2)[source]

Note

Experimental: This method may change, be moved, or removed in a future release with no prior warning.

Utility method for determining the optimal D value for seasonal SARIMAX ordering of: ('P', 'D', 'Q').

Parameters

m – The number of seasonal periods in the series.
test –
Type of unit test for seasonality. Supported tests: ['ocsb', 'ch'] See:

https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.OCSBTest. html#pmdarima.arima.OCSBTest

https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.CHTest. html#pmdarima.arima.CHTest

Default: 'ocsb'
max_D –
Maximum number of seasonal differences to test for.

Default: 2

Returns

Dictionary of {<group_key>: <optimal 'D' value>}

calculate_pacf(nlags=None, method='ywadjusted', alpha=None)[source]

Note

Experimental: This method may change, be moved, or removed in a future release with no prior warning.

Utility for calculating the partial autocorrelation function for each group. In conjunction with the autocorrelation function calculate_acf, the values returned from a pacf calculation can assist in setting values or bounds on AR, MA, and ARMA terms for an ARIMA model.

The general rule to determine whether to use an AR, MA, or ARMA configuration for ARIMA (or AutoARIMA) is as follows:

ACF gradually trend to significance, PACF significance achieved after 1 lag -> AR model
ACF significance after 1 lag, PACF gradually trend to significance -> MA model
ACF gradually trend to significance, PACF gradually trend to significance -> ARMA model

These results can help to set the order terms of an ARIMA model (p and q) or, for AutoARIMA, set restrictions on maximum search space terms to assist in faster optimization of the model.

Parameters

nlags –
The count of partial autocorrelation lags to calculate and return.

Default: 40
method –
The method used for pacf calculation. See the pmdarima docs for full listing of methods:

https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.utils.pacf.html

Default: 'ywadjusted'
alpha –
If specified, returns confidence intervals based on the alpha value supplied.

Default: None

Returns

Dictionary of {<group_key>: {<pacf terms>: <values as array>}}

decompose_groups(m, type_, filter_=None)[source]

Note

Experimental: This method may change, be moved, or removed in a future release with no prior warning.

Utility method that wraps pmdarima.arima.decompose() for each group within the passed-in DataFrame. Note: decomposition works best if the total number of entries within the series being decomposed is a multiple of the m parameter value.

Parameters

m – The frequency of the endogenous series. (i.e., for daily data, an m value of '7' would be appropriate for estimating a weekly seasonality, while setting m to '365' would be effective for yearly seasonality effects.)
type –
The type of decomposition to perform. One of: ['additive', 'multiplicative']

See: https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima. decompose.html
filter –
Optional Array for performing convolution. This is specified as a filter for coefficients (the Moving Average and/or Auto Regressor coefficients) in reverse time order in order to filter out a seasonal component.

Default: None

Returns

Pandas DataFrame with the decomposed trends for each group.

generate_diff(lag=1, differences=1)[source]

Note

Experimental: This method may change, be moved, or removed in a future release with no prior warning.

A utility for generating the array diff (lag differences) for each group. To support invertability, this method will return the starting value of each array as well as the differenced values.

Parameters

lag –
Determines the magnitude of the lag to calculate the differencing function for.

Default: 1
differences –
The order of the differencing to be performed. Note that values > 1 will generate n fewer results.

Default: 1

Returns

Dictionary of {<group_key>: {"series_start": <float>, "diff": <diff_array>}}

static generate_diff_inversion(group_diff_data, lag=1, differences=1, recenter=False)[source]

Note

Experimental: This method may change, be moved, or removed in a future release with no prior warning.

A utility for inverting a previously differenced group of timeseries data. This utility supports returning each group’s series data to the original range of the data if the recenter argument is set to True and the start conditions are contained within the group_diff_data argument’s dictionary structure.

Parameters

group_diff_data – Differenced payload consisting of a dictionary of {<group_key>: {'diff': <differenced data>, [optional]'series_start': float}}
lag –
The lag to use to perform the differencing inversion.

Default: 1
differences –
The order of differencing to be used during the inversion.

Default: 1
recenter –
If True and 'series_start' exists in group_diff_data dict, will restore the original series range for each group based on the series start value calculated through the generate_diff() method. If the group_diff_data does not contain the starting values, the data will not be re-centered.

Default: False

Returns

Dictionary of {<group_key>: <series_inverted_data>}

region	zone	country	ds	y
’northeast’	1	”US”	”2021-10-01”	1234.5
’northeast’	2	”US”	”2021-10-01”	3255.6
’northeast’	1	”US”	”2021-10-02”	1255.9