Uncertainty Quantification#

There are always uncertainties in computational predictions, it could be related to the sparsity of observed instances or the choice of hyperparameters and models. While it is impossible to make accurate point prediction, we can try to minimise prediction error rate by reducing uncertainties. Uncertainty quantification is a salient statistical technique to include uncertainty in the model training process so to provide a more accurate and reliable output for users.

With the NeuralProphet default configuration, a single value is predicted for each individual instance. The prediction output is just a point estimator without any consideration on uncertainties. Prediction intervals instead provide a more accurate estimation by quantifying the uncertainty and a possible value range for every single individual instance.

In this session, we will be introducing you to the two statistical techniques available in NeuralProphet: (1) quantile regression and (2) conformal prediction. These two modules are not mutually exclusive, you may apply both modules on top of any models.

The quantile regression module allows the algorithm to learn only a certain quantile of output variables for each instance. As for the conformal prediction module, it adds a calibration process on top of the model to quantify uncertainties in data for both point estimators and prediction intervals. You can also find more information about the concept quantile regression and conformal prediction by clicking their respective links.

We will illustrate and further elaborate on both quantification modules using the hospital electric load dataset. The dataset has recorded the electricity consumption of a hospital in SF in 2015 by hour.

[1]:

# much faster using the following code, but may not have the latest upgrades/bugfixes
# pip install neuralprophet

if "google.colab" in str(get_ipython()):
    # uninstall preinstalled packages from Colab to avoid conflicts
    !pip uninstall -y torch notebook notebook_shim tensorflow tensorflow-datasets prophet torchaudio torchdata torchtext torchvision
    !pip install git+https://github.com/ourownstory/neural_prophet.git  # may take a while

[2]:

import numpy as np
import pandas as pd
from neuralprophet import NeuralProphet, uncertainty_evaluate, set_log_level, set_random_seed

data_location = "http://raw.githubusercontent.com/ourownstory/neuralprophet-data/main/datasets/"
df = pd.read_csv(data_location + "energy/SF_hospital_load.csv")

Data splitting#

In NeuralProphet, there is a data splitting function which divide a dataset input into two subsets. You can configure the function by indicating the time-series frequencies and splitting ratio. A list of frequency aliases can be found here.

In our hospital electric load dataset, we will divide the original dataset into training and testing set with a train-test ratio of \(1/16\).

[3]:

# Create NeuralProphet object
m = NeuralProphet()

# Data spliting function split one time-series dataframe into two
# Configure the hourly frequency by assigning 'H' to parameter freq
# Configure the splitting ratio with a value between 0 and 1 for valid_p
train_df, test_df = m.split_df(df, freq="H", valid_p=1.0 / 16)

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.989% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

After this splitting, we will have 8213 and 547 instances extracted in sequence as training and testing set respectively.

[4]:

train_df.shape, test_df.shape

[4]:

((8213, 2), (547, 2))

1. Quantile Regression#

By the NeuralProphet forecasting default setting, you are only getting a single output as the point estimator for each instance. The point estimator is calculated based on a single 50th percentile regression. To generate a prediction interval, an NeuralProphet object needs a list of at least an upper and lower quantile pair as the parameter. However, you may create as many quantiles as you wish in a NeuralProphet model.

Back to forecasting our hospital electric load dataset. Assuming we want the true value to be within the estimator’s prediction interval 90% of the time (i.e., 90% confidence level), we create a three-quantiles regression model that outputs the 5th, 50th and 95th percentile values respectively.

[5]:

# NeuralProphet only accepts quantiles value in between 0 and 1
# Parameter for quantile regression
confidence_lv = 0.9
quantile_list = [round(((1 - confidence_lv) / 2), 2), round((confidence_lv + (1 - confidence_lv) / 2), 2)]

# Create NeuralProphet object with list of quantile as parameter
qr_model = NeuralProphet(quantiles=quantile_list)
qr_model.set_plotting_backend("plotly-static")

Once the quantile regression module is added on a model, NeuralProphet uses the pinball loss (quantile loss) function to assess goodness-of-fit of the trained model. Similiar to how log-likelihood loss function is used for Gaussian linear regression.

Instead of taking the absolute error in, pinball loss function has different error weightings for each different quantile. We usually take an upper quantile exceeding 50th percentile as the upper bound of prediction interval and another lower quantile below 50th percentile as the lower bound of interval. When the actual output lays outside the prediction interval, the loss function assigns a heavier weight for the absolute error and vice versa. We eventually minimise the loss function by adjusting the parameters of the quantile line iteratively.

Let’s see how weighting differs between different lower quantiles (10th vs 25th percentile). Percentile indicates the probability of having the true value below the estimation value. Comparing to the 25th percentile, the 10th percentile has a smaller expected probability of having the true value below the line. When the actual lays outside the prediction interval (i.e., the actual value is smaller than the predicted one), it is more problematic for a 10th percentile to have such error, as it expects 90% of the true value to be above the line, as opposed to the 75% above the line for the 25th percentile.

[6]:

# Fit the model with the observations
metrics = qr_model.fit(df, freq="H")

# Create a new dataframe for the results
# Including 100 historical values and 30 value points for the future
future = qr_model.make_future_dataframe(df, periods=30, n_historic_predictions=100)

# Perform prediction with the trained models
forecast = qr_model.predict(df=future)

WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale.
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.989% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling yearly seasonality. Run NeuralProphet with yearly_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 106

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.989% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.231% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.231% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Prediction made for hospital electric load is visualised below. yhat1 has shown the prediction made at the 50th percentile while the confidence interval has been shaped in light blue. The lower bound is the prediction result of the 10th quantile whereas the upper one is of the 90th.

[7]:

qr_model.plot(forecast)

../../_images/how-to-guides_feature-guides_uncertainty_quantification_14_0.svg

2. Conformal prediction#

While there are different ways to carry out conformal prediction, NeuralProphet adopts the split conformal prediction, which requires a holdout or calibration set. To carry out split conformal prediction, the dataset has to be split into three distinct sets for training, calibration and testing respectively. An initial prediction interval is created with the base model trained by the training dataset. Uncertainty is quantified by comparing the target variables in the calibration set against the predicted value. The final conformal prediction interval is then formed by adding the quantified uncertainty to both tails of the predicted value.

You can select Naive (or absolute residual) or Conformalized Quantile Regression (or CQR) for the conformal prediction in NeuralProphet. We will be discussing these two options in detail in the following sub section.

Calibration and validation set#

At least three subsets (i.e. testing, calibration and testing) are needed in the conformal prediction feature in NeuralProphet. You may choose to opt in a validation subset in this model. If you want to add in a validation subset to train the base model, please make sure the period of the validation subset must be in between of the training and calibration subsets. We will not cover validation procedure in detail here, you can however refer to the Train, Validate and Test procedure tutorial to know how to build up a NeuralProphet model using validation set.

Here, we further divide the calibration set from the training set with a train-calibration ratio of \(1/11\).

[8]:

# Add calibration set using the data splitting function
train_df, cal_df = m.split_df(train_df, freq="H", valid_p=1.0 / 11)

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.988% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

After this splitting, we will have 7467, 746 and 547 instances extracted in sequence as training, calibration and testing set respectively.

[9]:

train_df.shape, cal_df.shape, test_df.shape

[9]:

((7467, 2), (746, 2), (547, 2))

Base model training#

You can build any NeuralProphet model you deemed as fit as the base model. The calibration process in conformal prediction would be later added on the base model to quantify the uncertainty in our final estimation.

We are interested to know how conformal prediction affects different models. Back to our example, we will be comparing the conformal prediction results between a simple quantile regression and a complex 4-layer autoregression model in our illustration here. You can refer to the quantile regression session above and the Autoregression standalone tutorial material for the logic and applications of the features.

[10]:

# Parameter for autoregression
# Predict the value in the next hours based on the last three daysin an-hour steps
n_lags = 3 * 24

[11]:

# Create a simple quantile regression model
cp_model1 = NeuralProphet(quantiles=quantile_list)
cp_model1.set_plotting_backend("plotly-static")

# Create a 4-layer autoregression model as the base
cp_model2 = NeuralProphet(
    yearly_seasonality=False,
    weekly_seasonality=False,
    daily_seasonality=False,
    n_lags=n_lags,
    ar_layers=[32, 32, 32, 32],
    learning_rate=0.003,
    quantiles=quantile_list,
)
cp_model2.set_plotting_backend("plotly-static")

After finish configuring the model, we fit the model with the train set. If you have further split the training dataset into training and validation, you can either (i) concatenate the two datasets in one dataset for training or (ii) assign the training and validation datasets as two separated parameters.

[12]:

# Feed the training subset in the configured NeuralProphet models
# Configure the hourly frequency by assigning 'H' to parameter freq
set_random_seed(0)
metrics1 = cp_model1.fit(train_df, freq="H")
set_random_seed(0)
metrics2 = cp_model2.fit(train_df, freq="H")

WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale.
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.987% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling yearly seasonality. Run NeuralProphet with yearly_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 111
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (234) is too small than the required number                     for the learning rate finder (246). The results might not be optimal.

WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale.
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.987% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 111

We use the fitted base model to forecast both the point prediction and the quantile regression prediction intervals for the testing dataset.

[13]:

# Perform estimation for the testing data with the trained model
forecast1 = cp_model1.predict(test_df)[n_lags:]
forecast2 = cp_model2.predict(test_df)[n_lags:]

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.818% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Option 1: Naive Conformal Prediction#

After training the base model, we then carry out the calibration process using the naive module. The steps are outlined as follow: i. predict the output value of the instances within the calibration set; ii. calculate absolute residual by comparing the actual and predicted value for each observation in the calibration set; iii. sort all absolute residual in ascending order; iv. finds the quantified uncertainty (\(\hat{q}\)) with the desired confidence level; and v. use the quantified uncertainty (\(\hat{q}\)) to make the final prediciton intervals.

Going back to our example, we need to denote the parameter value for calibration set, significant level (alpha) for conformal prediction on top of the pre-trained models above.

[14]:

# Parameter for naive conformal prediction
method = "naive"
alpha = 1 - confidence_lv

# Enable conformal predict on the pre-trained models
# Evaluate parameter is optional, refer to the "Evaluate Performance" session below
naive_forecast1 = cp_model1.conformal_predict(
    test_df,
    calibration_df=cal_df,
    alpha=alpha,
    method=method,
    plotting_backend="plotly-static",
    show_all_PI=True,
)
naive_forecast2 = cp_model2.conformal_predict(
    test_df,
    calibration_df=cal_df,
    alpha=alpha,
    method=method,
    plotting_backend="plotly-static",
    show_all_PI=True,
)

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Data type cannot be displayed: application/vnd.plotly.v1+json

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.818% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Data type cannot be displayed: application/vnd.plotly.v1+json

The plots above demonstrate how quantified uncertainty (\(\hat{q}\)) change with different confidence level (1-alpha).

Below are the \(\hat{q}\) values for each model. As they are all positive, you will see the naive interval extending beyond that of simple QR. The weaker model has a large gap between these two intervals since it has a larger \(\hat{q}\) value, while the bounds are shifted much less for the complex model.

[15]:

naive_forecast1

[15]:

	ds	y	yhat1	yhat1 - qhat1	yhat1 + qhat1	trend	season_weekly	season_daily	yhat1 5.0%	yhat1 95.0%
0	2015-12-09 06:00:00	803.410865	988.585815	827.912791	1149.258840	1031.427246	-35.461987	-7.379447	909.404419	1087.498291
1	2015-12-09 07:00:00	868.264194	1089.076294	928.403269	1249.749319	1031.444336	-43.485428	101.117432	994.053711	1189.946045
2	2015-12-09 08:00:00	975.752982	1146.444092	985.771067	1307.117117	1031.461304	-51.675148	166.657928	1057.043823	1236.407227
3	2015-12-09 09:00:00	983.268943	1155.003296	994.330271	1315.676321	1031.478271	-59.986122	183.511185	1078.018799	1243.008301
4	2015-12-09 10:00:00	1095.825986	1145.466553	984.793528	1306.139578	1031.495239	-68.370621	182.341919	1073.590576	1236.537354
...	...	...	...	...	...	...	...	...	...	...
542	2015-12-31 20:00:00	845.563081	833.224609	672.551584	993.897634	1040.634766	-103.066475	-104.343666	757.201721	912.330200
543	2015-12-31 21:00:00	827.530521	795.515076	634.842051	956.188101	1040.651733	-94.571983	-150.564713	711.605408	872.624634
544	2015-12-31 22:00:00	829.256300	771.299255	610.626230	931.972280	1040.668701	-85.883865	-183.485596	677.626831	873.617920
545	2015-12-31 23:00:00	813.937205	762.211975	601.538950	922.885000	1040.685791	-77.059326	-201.414413	669.592285	869.896973
546	2016-01-01 00:00:00	815.588584	765.104858	604.431834	925.777883	1040.702759	-68.155449	-207.442413	675.509155	861.297852

547 rows × 10 columns

[16]:

naive_qhat1 = naive_forecast1.iloc[-1]["yhat1"] - naive_forecast1.iloc[-1]["yhat1 - qhat1"]  # cp_model1
naive_qhat2 = naive_forecast2.iloc[-1]["yhat1"] - naive_forecast2.iloc[-1]["yhat1 - qhat1"]  # cp_model2
naive_qhat1, naive_qhat2

[16]:

(160.67302487812503, 20.35876320312491)

We can then plot the predicted intervals (5th, 50th and 95th percentile values) to compare the performance of the models. The quantile regression prediction intervals are annoted in blue while the conformal prediction intervals with the quantified uncertainty included are denoted in red.

With the same quantile parameter, the simple model has a much wider quantile regression prediction interval (in blue) comparing to the complex model. The same happens to the conformal prediction intervals (in red), where the weaker model has a wider width, as such it can capture more actual values than the quantile regression prediction intervals.

[17]:

# Date range shown in the plots (optional)
cutoff = 7 * 24

fig1 = cp_model1.highlight_nth_step_ahead_of_each_forecast(1).plot(
    naive_forecast1[-cutoff:], plotting_backend="plotly-static"
)
fig2 = cp_model2.highlight_nth_step_ahead_of_each_forecast(1).plot(
    naive_forecast2[-cutoff:], plotting_backend="plotly-static"
)

WARNING - (NP.forecaster.plot) - highlight_forecast_step_n is ignored since auto-regression not enabled.

../../_images/how-to-guides_feature-guides_uncertainty_quantification_34_1.svg

../../_images/how-to-guides_feature-guides_uncertainty_quantification_34_2.svg

Option 2: Conformalized Quantile Regression#

In Conformalized Quantile Regression, or the cqr module, the method runs as follows: i. calculate non-conformity scores as the differences between data points from the calibration dataset and their nearest prediction quantile, which provides a measure of how well the data fits the current quantile regression model. The non-conformity scores are negative for data points within the quantile regression interval and positive if they are outside the interval; ii. sort list of non-conformity scores; iii. find the value of \(\hat{q}\) such that the portion of scores in the list greater than \(\hat{q}\) is equal to that error rate; and iv. adjust the quantiles from the regression model by an amount (\(\hat{q}\)).

There are two scenarios for what the CQR model means based on the value of \(\hat{q}\). If the one-sided prediction interval width adjustment is positive, then CQR extends beyond the QR intervals, as it deems the QR interval to be too confident. Conversely, if the one-sided prediction interval width adjustment is negative, then CQR contracts the QR intervals, as it deems the QR interval to be too conservative.

[18]:

# Parameter for Conformalized Quantile Regression
method = "cqr"

# Enable conformal predict on the pre-trained models
# Evaluate parameter is optional, refer to the "Evaluate Performance" session below
cqr_forecast1 = cp_model1.conformal_predict(
    test_df, calibration_df=cal_df, alpha=alpha, method=method, plotting_backend="plotly-static"
)
cqr_forecast2 = cp_model2.conformal_predict(
    test_df, calibration_df=cal_df, alpha=alpha, method=method, plotting_backend="plotly-static"
)

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Data type cannot be displayed: application/vnd.plotly.v1+json

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.818% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H

INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Data type cannot be displayed: application/vnd.plotly.v1+json

Below are the \(\hat{q}\) values for each model. As they are all positive, you will see the CQR interval extending beyond that of simple QR. cp_model1 has a large gap between these two intervals since it has a larger \(\hat{q}\) value, while the bounds are shifted much less for cp_model2.

We again plot the prediction intervals to examine how this CQR method affects the result.

[19]:

fig1 = cp_model1.highlight_nth_step_ahead_of_each_forecast(1).plot(
    cqr_forecast1[-cutoff:], plotting_backend="plotly-static"
)
fig2 = cp_model2.highlight_nth_step_ahead_of_each_forecast(1).plot(
    cqr_forecast2[-cutoff:], plotting_backend="plotly-static"
)

WARNING - (NP.forecaster.plot) - highlight_forecast_step_n is ignored since auto-regression not enabled.

../../_images/how-to-guides_feature-guides_uncertainty_quantification_39_1.svg

../../_images/how-to-guides_feature-guides_uncertainty_quantification_39_2.svg

Evaluate Performance#

We are using interval width and miscoverage rate as the performance metrics. - interval_width: The average prediction interval, or q_hat multiplied by 2 because it is static or non-adaptive, this is also knowns as the efficiency metric. - miscoverage_rate: The actual miscoverage error rate on the OOS test set, this is also knowns as the validity metric.

The smaller the metrics are, the better the models are performing.

Let’s test with the models we trained above. We first build a dataframe with two rows for the two models we trained and then we will concatenate the Naive and CQR performance metrics in the dataframe for the evaluation.

[20]:

# Create evaluation dataframe skeleton with the 2 models (denoted as m1 and m2)
models = ["m1", "m2"]
eval_df = pd.DataFrame(models, columns=pd.MultiIndex.from_tuples([("model", "", "")]))

We are aggregating the performance metrics for naive and put them in the evaluation dataframe skeleton at this session. The performance metrics are calculated when you have turned on the evaluate parameter in the conformal prediction feature.

[21]:

# Extract the naive performance metrics from their respective forecast datasets
naive_eval1 = uncertainty_evaluate(naive_forecast1)
naive_eval2 = uncertainty_evaluate(naive_forecast2)

# Aggregate the naive performance metrics for m1 and m2
naive_evals = [naive_eval1, naive_eval2]
naive_eval_df = pd.concat(naive_evals).reset_index(drop=True)

[22]:

# Extract the cqr performance metrics from their respective forecast datasets
cqr_eval1 = uncertainty_evaluate(cqr_forecast1)
cqr_eval2 = uncertainty_evaluate(cqr_forecast2)

# Aggregate the cqr performance metrics for m1 and m2
cqr_evals = [cqr_eval1, cqr_eval2]
cqr_eval_df = pd.concat(cqr_evals).reset_index(drop=True)

Lastly, we concatenate the naive and cqr evaluation dataframes and then compare how the models are performing with the naive and cqr prediction.

[23]:

# Concatenate the naive and cqr evaluation dataframes
eval_df = pd.concat([naive_eval_df, cqr_eval_df], axis=1, keys=["naive", "cqr"])
eval_df

[23]:

	naive		cqr
	yhat1		yhat1
	interval_width	miscoverage_rate	interval_width	miscoverage_rate
0	321.346050	0.107861	309.400519	0.107861
1	40.717526	0.065263	41.712111	0.063158

Performance of the trained model above:

This notebook is only using single forecast timestep models, hence we would only have yhat1 as the point estimator.
Across all three uncertainty prediction methods, the complex autoregression model (m2) has a smaller interval_width and miscoverage_rate than the simpler model (m1). As such, we can conclude that the more complex the model is, the better the data are fitted and the more accurate the predictions are made.
Both Naive and CQR conformal prediction methods significantly outperforms vanilla QR in terms of miscoverage_rate. This shows that the vanilla QR is overconfident in its quantile range. The interval_width should be broadened further to converge its actual miscoverage_rate (on out-of-sample test set) to the specified alpha at 0.1.
The interval_width doubles the quantified uncertainties (qhat1) for the Naive conformal predictions, as such you will be seeing symmetrical prediction intervals.
When we are only looking at the simple quantile regression model (m1), CQR is more preferable. Though it has the same miscoverage_rate as Naive, it still has a narrower prediction interval_width.
As for the complex model (m2), Naive has a slightly better interval_width, but CQR has a slightly better miscoverage_rate. You may want to feed the model with more data to determine which method more preferable.