Data Processing
The data manipulation APIs are a key component of the utility of this library. While they are largely obfuscated by the main entry point APIs for each forecasting library, they can be useful for custom implementations and for performing validation of source data sets.
Pandas DataFrame Group Processing API
Class signature:
- class diviner.data.pandas_group_generator.PandasGroupGenerator(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]
This class is used to convert a normalized collection of time series data within a single
DataFrame
, e.g.:region
zone
ds
y
‘northeast’
1
“2021-10-01”
1234.5
‘northeast’
2
“2021-10-01”
3255.6
‘northeast’
1
“2021-10-02”
1255.9
With the grouping keys
['region', 'zone']
define the unique series of the targety
indexed byds
.This class will
Generate a master group key that is a tuple zip of the grouping key arguments specified by the user, preserving the order of declaration of these keys.
Group the
DataFrame
by these master grouping keys and generate a collection of tuples of the form(master_grouping_key, <series DataFrame>)
which is used for iterating over to generate the individualized forecasting models for each master key group.
- __init__(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]
- Parameters
group_key_columns – Grouping columns that a combination of which designates a combination of
ds
andy
that represent a distinct series.datetime_col – The name of the column that contains the
datetime
values for each series.y_col – The endogenous regressor element of the series. This is the value that is used for training and is the element that is intending to be forecast.
- _get_df_with_master_key_column(df: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]
Method for creating the ‘master_group_key’ column that defines a unique group. The master_group_key column is generated from the concatenation (within a tuple) of the values in each of the individual _group_key_columns, serving as an aggregation grouping key to define a unique collection of datetime series values. For example:
region
zone
ds
y
‘northeast’
1
“2021-10-01”
1234.5
‘northeast’
2
“2021-10-01”
3255.6
‘northeast’
1
“2021-10-02”
1255.9
With the above dataset, the
group_key_columns
passed in would be:('region', 'zone')
This method will modify the inputDataFrame
by adding themaster_group_key
as follows:region
zone
ds
y
grouping_key
‘northeast’
1
“2021-10-01”
1234.5
(‘northeast’, 1)
‘northeast’
2
“2021-10-01”
3255.6
(‘northeast’, 2)
‘northeast’
1
“2021-10-02”
1255.9
(‘northeast’, 1)
- Parameters
df – The normalized
DataFrame
- Returns
A copy of the passed-in
DataFrame
with a master grouping key column added that contains the group definitions per row of the inputDataFrame
.
- generate_prediction_groups(df: pandas.core.frame.DataFrame)[source]
Method for generating the data set collection required to run a manual per
datetime
prediction for arbitrary datetime and key groupings.- Parameters
df – Normalized
DataFrame
that contains the columns defined in instance attribute_group_key_columns
within its schema and the dates for prediction within thedatetime_col
field.- Returns
List(tuple(master_group_key, df))
the processing collection ofDataFrame
coupled with their group identifier.
- generate_processing_groups(df: pandas.core.frame.DataFrame)[source]
Method for generating the collection of
[(master_grouping_key, <group DataFrame>)]
This method will call
_create_master_key_column()
to generate a column containing the tuple of the values within the_group_key_columns
fields, then generate an iterable collection ofkey
->DataFrame
representation.For example, after adding the
grouping_key
column from_create_master_key_column()
, theDataFrame
will look like thisregion
zone
ds
y
grouping_key
‘northeast’
1
“2021-10-01”
1234.5
(‘northeast’, 1)
‘northeast’
2
“2021-10-01”
3255.6
(‘northeast’, 2)
‘northeast’
1
“2021-10-02”
1255.9
(‘northeast’, 1)
This method will translate this structure to
[(('northeast', 1),
ds
y
“2021-10-01”
1234.5
“2021-10-02”
1255.9
), (('northeast', 2),
ds
y
“2021-10-01”
3255.6
“2021-10-02”
1255.9
)]
- Parameters
df – Normalized
DataFrame
that contains the columns defined in instance attribute_group_key_columns
within its schema.- Returns
List(tuple(master_group_key, df))
the processing collection ofDataFrame
coupled with their group identifier.
Developer API for Data Processing
Abstract Base Class for grouped processing of a fully normalized DataFrame
:
Abstract Base Class for defining the API contract for group generator operations. This base class is a template for package-specific implementations that function to convert a normalized representation of grouped time series into per-group collections of discrete time series so that forecasting models can be trained on each group.
- class diviner.data.base_group_generator.BaseGroupGenerator(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]
Abstract class for defining the basic elements of performing a group processing collection generation operation.
- __init__(group_key_columns: Tuple, datetime_col: str, y_col: str)[source]
Grouping key columns must be defined to serve in the construction of a consolidated single unique key that is used to identify a particular unique time series. The unique combinations of these provided fields define and control the grouping of univariate series data in order to train (fit) a particular model upon each of the unique series (that are defined by the combination of the values within these supplied columns).
The primary purpose of the children of this class is to generate a dictionary of:
{<group_key> : <DataFrame with unique univariate series>}
. The`group_key
element is constructed as a tuple of the values within the columns specified by_group_key_columns
in this class constructor.For example, with a normalized data set provided of:
ds
y
group1
group2
2021-09-02
11.1
“a”
“z”
2021-09-03
7.33
“a”
“z”
2021-09-02
31.1
“b”
“q”
2021-09-03
44.1
“b”
“q”
There are two separate univariate series:
("a", "z")
and("b", "q")
. The group generator’s function is to convert this unionedDataFrame
into the following:{ ("a", "z"):
ds
y
group1
group2
2021-09-02
11.1
“a”
“z”
2021-09-03
7.33
“a”
“z”
,("b", "q"):
ds
y
group1
group2
2021-09-02
31.1
“b”
“q”
2021-09-03
44.1
“b”
“q”
}
This grouping allows for a model to be fit to each of these series in isolation.
- Parameters
group_key_columns –
Tuple[str]
of column names that determine which elements of the submittedDataFrame
determine uniqueness of a particular time series.datetime_col – The name of the column that contains the
datetime
values for each series.y_col – The endogenous regressor element of the series. This is the value that is used for training and is the element that is intending to be forecast.
- __weakref__
list of weak references to the object (if defined)
- abstract generate_prediction_groups(df)[source]
Abstract method for generating the data set collection required for manual prediction for arbitrary datetime and key groupings.
- Parameters
df – Normalized
DataFrame
that contains the columns defined in instance attribute_group_key_columns
within its schema and the dates for prediction within thedatetime_col
field.- Returns
List(tuple(master_group_key, df))
the processing collection ofDataFrame
coupled with their group identifier.
- abstract generate_processing_groups(df)[source]
Abstract method for the generation of processing execution groups for individual models. Implementations of this method should generate a processing collection that is a relation between the unique combinations of
_group_key_columns
values, generated as a_master_group_key
entry that defines a specific datetime series for forecasting.For example, with a normalized dataframe input of
ds
region
country
y
2020-01-01
SW
USA
42
2020-01-02
SW
USA
11
2020-01-01
NE
USA
31
2020-01-01
Ontario
CA
12
The output structure should be, with the group_keys value specified as:
("country", "region"):[{ ("USA", "SW"):
ds
region
country
y
2020-01-01
SW
USA
42
2020-01-02
SW
USA
11
}. {("USA", "NE"):
ds
region
country
y
2020-01-01
NE
USA
31
}, {("CA", "Ontario"):
ds
region
country
y
2020-01-01
Ontario
CA
12
}]
The list wrapper around dictionaries is to allow for multiprocessing support without having to contend with encapsulating the entire dictionary for the processing of a single key and value pair.
- Parameters
df – The user-input normalized DataFrame with _group_key_columns
- Returns
A list of dictionaries of
{group_key: <group's univariate series data>}
structure for isolated processing by the model APIs.