Data Processing

The data manipulation APIs are a key component of the utility of this library. While they are largely obfuscated by the main entry point APIs for each forecasting library, they can be useful for custom implementations and for performing validation of source data sets.

Pandas DataFrame Group Processing API

Class signature:

class diviner.data.pandas_group_generator.PandasGroupGenerator(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]

This class is used to convert a normalized collection of time series data within a single DataFrame, e.g.:

region	zone	ds	y
‘northeast’	1	“2021-10-01”	1234.5
‘northeast’	2	“2021-10-01”	3255.6
‘northeast’	1	“2021-10-02”	1255.9

With the grouping keys ['region', 'zone'] define the unique series of the target y indexed by ds.

This class will

Generate a master group key that is a tuple zip of the grouping key arguments specified by the user, preserving the order of declaration of these keys.
Group the DataFrame by these master grouping keys and generate a collection of tuples of the form (master_grouping_key, <series DataFrame>) which is used for iterating over to generate the individualized forecasting models for each master key group.

__init__(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]

Parameters

group_key_columns – Grouping columns that a combination of which designates a combination of ds and y that represent a distinct series.
datetime_col – The name of the column that contains the datetime values for each series.
y_col – The endogenous regressor element of the series. This is the value that is used for training and is the element that is intending to be forecast.

_get_df_with_master_key_column(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

Method for creating the ‘master_group_key’ column that defines a unique group. The master_group_key column is generated from the concatenation (within a tuple) of the values in each of the individual _group_key_columns, serving as an aggregation grouping key to define a unique collection of datetime series values. For example:

region	zone	ds	y
‘northeast’	1	“2021-10-01”	1234.5
‘northeast’	2	“2021-10-01”	3255.6
‘northeast’	1	“2021-10-02”	1255.9

With the above dataset, the group_key_columns passed in would be: ('region', 'zone') This method will modify the input DataFrame by adding the master_group_key as follows:

region	zone	ds	y	grouping_key
‘northeast’	1	“2021-10-01”	1234.5	(‘northeast’, 1)
‘northeast’	2	“2021-10-01”	3255.6	(‘northeast’, 2)
‘northeast’	1	“2021-10-02”	1255.9	(‘northeast’, 1)

Parameters: df – The normalized DataFrame
Returns: A copy of the passed-in DataFrame with a master grouping key column added that contains the group definitions per row of the input DataFrame.

generate_prediction_groups(df: pandas.core.frame.DataFrame)[source]

Method for generating the data set collection required to run a manual per datetime prediction for arbitrary datetime and key groupings.

Parameters: df – Normalized DataFrame that contains the columns defined in instance attribute _group_key_columns within its schema and the dates for prediction within the datetime_col field.
Returns: List(tuple(master_group_key, df)) the processing collection of DataFrame coupled with their group identifier.

generate_processing_groups(df: pandas.core.frame.DataFrame)[source]

Method for generating the collection of [(master_grouping_key, <group DataFrame>)]

This method will call _create_master_key_column() to generate a column containing the tuple of the values within the _group_key_columns fields, then generate an iterable collection of key -> DataFrame representation.

For example, after adding the grouping_key column from _create_master_key_column(), the DataFrame will look like this

region	zone	ds	y	grouping_key
‘northeast’	1	“2021-10-01”	1234.5	(‘northeast’, 1)
‘northeast’	2	“2021-10-01”	3255.6	(‘northeast’, 2)
‘northeast’	1	“2021-10-02”	1255.9	(‘northeast’, 1)

This method will translate this structure to

[(('northeast', 1),

ds

y

“2021-10-01”

1234.5

“2021-10-02”

1255.9

), (('northeast', 2),

ds

y

“2021-10-01”

3255.6

“2021-10-02”

1255.9

)]

Parameters: df – Normalized DataFrame that contains the columns defined in instance attribute _group_key_columns within its schema.
Returns: List(tuple(master_group_key, df)) the processing collection of DataFrame coupled with their group identifier.

Developer API for Data Processing

Abstract Base Class for grouped processing of a fully normalized DataFrame :

Abstract Base Class for defining the API contract for group generator operations. This base class is a template for package-specific implementations that function to convert a normalized representation of grouped time series into per-group collections of discrete time series so that forecasting models can be trained on each group.

class diviner.data.base_group_generator.BaseGroupGenerator(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]

Abstract class for defining the basic elements of performing a group processing collection generation operation.

__init__(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]

Grouping key columns must be defined to serve in the construction of a consolidated single unique key that is used to identify a particular unique time series. The unique combinations of these provided fields define and control the grouping of univariate series data in order to train (fit) a particular model upon each of the unique series (that are defined by the combination of the values within these supplied columns).

The primary purpose of the children of this class is to generate a dictionary of: {<group_key> : <DataFrame with unique univariate series>}. The `group_key element is constructed as a tuple of the values within the columns specified by _group_key_columns in this class constructor.

For example, with a normalized data set provided of:

ds	y	group1	group2
2021-09-02	11.1	“a”	“z”
2021-09-03	7.33	“a”	“z”
2021-09-02	31.1	“b”	“q”
2021-09-03	44.1	“b”	“q”

There are two separate univariate series: ("a", "z") and ("b", "q"). The group generator’s function is to convert this unioned DataFrame into the following:

{ ("a", "z"):

ds	y	group1	group2
2021-09-02	11.1	“a”	“z”
2021-09-03	7.33	“a”	“z”

,("b", "q"):

ds	y	group1	group2
2021-09-02	31.1	“b”	“q”
2021-09-03	44.1	“b”	“q”

}

This grouping allows for a model to be fit to each of these series in isolation.

Parameters

group_key_columns – Tuple[str] of column names that determine which elements of the submitted DataFrame determine uniqueness of a particular time series.
datetime_col – The name of the column that contains the datetime values for each series.
y_col – The endogenous regressor element of the series. This is the value that is used for training and is the element that is intending to be forecast.

__weakref__: list of weak references to the object (if defined)

abstract generate_prediction_groups(df)[source]

Abstract method for generating the data set collection required for manual prediction for arbitrary datetime and key groupings.

Parameters: df – Normalized DataFrame that contains the columns defined in instance attribute _group_key_columns within its schema and the dates for prediction within the datetime_col field.
Returns: List(tuple(master_group_key, df)) the processing collection of DataFrame coupled with their group identifier.

abstract generate_processing_groups(df)[source]

Abstract method for the generation of processing execution groups for individual models. Implementations of this method should generate a processing collection that is a relation between the unique combinations of _group_key_columns values, generated as a _master_group_key entry that defines a specific datetime series for forecasting.

For example, with a normalized dataframe input of

ds	region	country	y
2020-01-01	SW	USA	42
2020-01-02	SW	USA	11
2020-01-01	NE	USA	31
2020-01-01	Ontario	CA	12

The output structure should be, with the group_keys value specified as:

("country", "region"):[{ ("USA", "SW"):

ds	region	country	y
2020-01-01	SW	USA	42
2020-01-02	SW	USA	11

}. {("USA", "NE"):

ds	region	country	y
2020-01-01	NE	USA	31

}, {("CA", "Ontario"):

ds	region	country	y
2020-01-01	Ontario	CA	12

}]

The list wrapper around dictionaries is to allow for multiprocessing support without having to contend with encapsulating the entire dictionary for the processing of a single key and value pair.

Parameters: df – The user-input normalized DataFrame with _group_key_columns
Returns: A list of dictionaries of {group_key: <group's univariate series data>} structure for isolated processing by the model APIs.