Data Processing
The data manipulation APIs are a key component of the utility of this library. While they are largely obfuscated by the main entry point APIs for each forecasting library, they can be useful for custom implementations and for performing validation of source data sets.
Pandas DataFrame Group Processing API
Class signature:
- class diviner.data.pandas_group_generator.PandasGroupGenerator(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]
This class is used to convert a normalized collection of time series data within a single
DataFrame, e.g.:region
zone
ds
y
‘northeast’
1
“2021-10-01”
1234.5
‘northeast’
2
“2021-10-01”
3255.6
‘northeast’
1
“2021-10-02”
1255.9
With the grouping keys
['region', 'zone']define the unique series of the targetyindexed byds.This class will
Generate a master group key that is a tuple zip of the grouping key arguments specified by the user, preserving the order of declaration of these keys.
Group the
DataFrameby these master grouping keys and generate a collection of tuples of the form(master_grouping_key, <series DataFrame>)which is used for iterating over to generate the individualized forecasting models for each master key group.
- __init__(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]
- Parameters
group_key_columns – Grouping columns that a combination of which designates a combination of
dsandythat represent a distinct series.datetime_col – The name of the column that contains the
datetimevalues for each series.y_col – The endogenous regressor element of the series. This is the value that is used for training and is the element that is intending to be forecast.
- _get_df_with_master_key_column(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]
Method for creating the ‘master_group_key’ column that defines a unique group. The master_group_key column is generated from the concatenation (within a tuple) of the values in each of the individual _group_key_columns, serving as an aggregation grouping key to define a unique collection of datetime series values. For example:
region
zone
ds
y
‘northeast’
1
“2021-10-01”
1234.5
‘northeast’
2
“2021-10-01”
3255.6
‘northeast’
1
“2021-10-02”
1255.9
With the above dataset, the
group_key_columnspassed in would be:('region', 'zone')This method will modify the inputDataFrameby adding themaster_group_keyas follows:region
zone
ds
y
grouping_key
‘northeast’
1
“2021-10-01”
1234.5
(‘northeast’, 1)
‘northeast’
2
“2021-10-01”
3255.6
(‘northeast’, 2)
‘northeast’
1
“2021-10-02”
1255.9
(‘northeast’, 1)
- Parameters
df – The normalized
DataFrame- Returns
A copy of the passed-in
DataFramewith a master grouping key column added that contains the group definitions per row of the inputDataFrame.
- generate_prediction_groups(df: pandas.core.frame.DataFrame)[source]
Method for generating the data set collection required to run a manual per
datetimeprediction for arbitrary datetime and key groupings.- Parameters
df – Normalized
DataFramethat contains the columns defined in instance attribute_group_key_columnswithin its schema and the dates for prediction within thedatetime_colfield.- Returns
List(tuple(master_group_key, df))the processing collection ofDataFramecoupled with their group identifier.
- generate_processing_groups(df: pandas.core.frame.DataFrame)[source]
Method for generating the collection of
[(master_grouping_key, <group DataFrame>)]This method will call
_create_master_key_column()to generate a column containing the tuple of the values within the_group_key_columnsfields, then generate an iterable collection ofkey->DataFramerepresentation.For example, after adding the
grouping_keycolumn from_create_master_key_column(), theDataFramewill look like thisregion
zone
ds
y
grouping_key
‘northeast’
1
“2021-10-01”
1234.5
(‘northeast’, 1)
‘northeast’
2
“2021-10-01”
3255.6
(‘northeast’, 2)
‘northeast’
1
“2021-10-02”
1255.9
(‘northeast’, 1)
This method will translate this structure to
[(('northeast', 1),ds
y
“2021-10-01”
1234.5
“2021-10-02”
1255.9
), (('northeast', 2),ds
y
“2021-10-01”
3255.6
“2021-10-02”
1255.9
)]- Parameters
df – Normalized
DataFramethat contains the columns defined in instance attribute_group_key_columnswithin its schema.- Returns
List(tuple(master_group_key, df))the processing collection ofDataFramecoupled with their group identifier.
Developer API for Data Processing
Abstract Base Class for grouped processing of a fully normalized DataFrame :
Abstract Base Class for defining the API contract for group generator operations. This base class is a template for package-specific implementations that function to convert a normalized representation of grouped time series into per-group collections of discrete time series so that forecasting models can be trained on each group.
- class diviner.data.base_group_generator.BaseGroupGenerator(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]
Abstract class for defining the basic elements of performing a group processing collection generation operation.
- __init__(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]
Grouping key columns must be defined to serve in the construction of a consolidated single unique key that is used to identify a particular unique time series. The unique combinations of these provided fields define and control the grouping of univariate series data in order to train (fit) a particular model upon each of the unique series (that are defined by the combination of the values within these supplied columns).
The primary purpose of the children of this class is to generate a dictionary of:
{<group_key> : <DataFrame with unique univariate series>}. The`group_keyelement is constructed as a tuple of the values within the columns specified by_group_key_columnsin this class constructor.For example, with a normalized data set provided of:
ds
y
group1
group2
2021-09-02
11.1
“a”
“z”
2021-09-03
7.33
“a”
“z”
2021-09-02
31.1
“b”
“q”
2021-09-03
44.1
“b”
“q”
There are two separate univariate series:
("a", "z")and("b", "q"). The group generator’s function is to convert this unionedDataFrameinto the following:{ ("a", "z"):ds
y
group1
group2
2021-09-02
11.1
“a”
“z”
2021-09-03
7.33
“a”
“z”
,("b", "q"):ds
y
group1
group2
2021-09-02
31.1
“b”
“q”
2021-09-03
44.1
“b”
“q”
}This grouping allows for a model to be fit to each of these series in isolation.
- Parameters
group_key_columns –
Tuple[str]of column names that determine which elements of the submittedDataFramedetermine uniqueness of a particular time series.datetime_col – The name of the column that contains the
datetimevalues for each series.y_col – The endogenous regressor element of the series. This is the value that is used for training and is the element that is intending to be forecast.
- __weakref__
list of weak references to the object (if defined)
- abstract generate_prediction_groups(df)[source]
Abstract method for generating the data set collection required for manual prediction for arbitrary datetime and key groupings.
- Parameters
df – Normalized
DataFramethat contains the columns defined in instance attribute_group_key_columnswithin its schema and the dates for prediction within thedatetime_colfield.- Returns
List(tuple(master_group_key, df))the processing collection ofDataFramecoupled with their group identifier.
- abstract generate_processing_groups(df)[source]
Abstract method for the generation of processing execution groups for individual models. Implementations of this method should generate a processing collection that is a relation between the unique combinations of
_group_key_columnsvalues, generated as a_master_group_keyentry that defines a specific datetime series for forecasting.For example, with a normalized dataframe input of
ds
region
country
y
2020-01-01
SW
USA
42
2020-01-02
SW
USA
11
2020-01-01
NE
USA
31
2020-01-01
Ontario
CA
12
The output structure should be, with the group_keys value specified as:
("country", "region"):[{ ("USA", "SW"):ds
region
country
y
2020-01-01
SW
USA
42
2020-01-02
SW
USA
11
}. {("USA", "NE"):ds
region
country
y
2020-01-01
NE
USA
31
}, {("CA", "Ontario"):ds
region
country
y
2020-01-01
Ontario
CA
12
}]The list wrapper around dictionaries is to allow for multiprocessing support without having to contend with encapsulating the entire dictionary for the processing of a single key and value pair.
- Parameters
df – The user-input normalized DataFrame with _group_key_columns
- Returns
A list of dictionaries of
{group_key: <group's univariate series data>}structure for isolated processing by the model APIs.