Uncertainty Quantification#

There are always uncertainties in computational predictions, it could be related to the sparsity of observed instances or the choice of hyperparameters and models. While it is impossible to make accurate point prediction, we can try to minimise prediction error rate by reducing uncertainties. Uncertainty quantification is a salient statistical technique to include uncertainty in the model training process so to provide a more accurate and reliable output for users.

With the NeuralProphet default configuration, a single value is predicted for each individual instance. The prediction output is just a point estimator without any consideration on uncertainties. Prediction intervals instead provide a more accurate estimation by quantifying the uncertainty and a possible value range for every single individual instance.

In this session, we will be introducing you to the two statistical techniques available in NeuralProphet: (1) quantile regression and (2) conformal prediction. These two modules are not mutually exclusive, you may apply both modules on top of any models.

The quantile regression module allows the algorithm to learn only a certain quantile of output variables for each instance. As for the conformal prediction module, it adds a calibration process on top of the model to quantify uncertainties in data for both point estimators and prediction intervals. You can also find more information about the concept quantile regression and conformal prediction by clicking their respective links.

We will illustrate and further elaborate on both quantification modules using the hospital electric load dataset. The dataset has recorded the electricity consumption of a hospital in SF in 2015 by hour.

[1]:
# much faster using the following code, but may not have the latest upgrades/bugfixes
# pip install neuralprophet

if "google.colab" in str(get_ipython()):
    # uninstall preinstalled packages from Colab to avoid conflicts
    !pip uninstall -y torch notebook notebook_shim tensorflow tensorflow-datasets prophet torchaudio torchdata torchtext torchvision
    !pip install git+https://github.com/ourownstory/neural_prophet.git  # may take a while
[2]:
import numpy as np
import pandas as pd
from neuralprophet import NeuralProphet, uncertainty_evaluate, set_log_level, set_random_seed

data_location = "http://raw.githubusercontent.com/ourownstory/neuralprophet-data/main/datasets/"
df = pd.read_csv(data_location + "energy/SF_hospital_load.csv")

Data splitting#

In NeuralProphet, there is a data splitting function which divide a dataset input into two subsets. You can configure the function by indicating the time-series frequencies and splitting ratio. A list of frequency aliases can be found here.

In our hospital electric load dataset, we will divide the original dataset into training and testing set with a train-test ratio of \(1/16\).

[3]:
# Create NeuralProphet object
m = NeuralProphet()

# Data spliting function split one time-series dataframe into two
# Configure the hourly frequency by assigning 'H' to parameter freq
# Configure the splitting ratio with a value between 0 and 1 for valid_p
train_df, test_df = m.split_df(df, freq="H", valid_p=1.0 / 16)
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.989% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

After this splitting, we will have 8213 and 547 instances extracted in sequence as training and testing set respectively.

[4]:
train_df.shape, test_df.shape
[4]:
((8213, 2), (547, 2))

1. Quantile Regression#

By the NeuralProphet forecasting default setting, you are only getting a single output as the point estimator for each instance. The point estimator is calculated based on a single 50th percentile regression. To generate a prediction interval, an NeuralProphet object needs a list of at least an upper and lower quantile pair as the parameter. However, you may create as many quantiles as you wish in a NeuralProphet model.

Back to forecasting our hospital electric load dataset. Assuming we want the true value to be within the estimator’s prediction interval 90% of the time (i.e., 90% confidence level), we create a three-quantiles regression model that outputs the 5th, 50th and 95th percentile values respectively.

[5]:
# NeuralProphet only accepts quantiles value in between 0 and 1
# Parameter for quantile regression
confidence_lv = 0.9
quantile_list = [round(((1 - confidence_lv) / 2), 2), round((confidence_lv + (1 - confidence_lv) / 2), 2)]

# Create NeuralProphet object with list of quantile as parameter
qr_model = NeuralProphet(quantiles=quantile_list)
qr_model.set_plotting_backend("plotly-static")

Once the quantile regression module is added on a model, NeuralProphet uses the pinball loss (quantile loss) function to assess goodness-of-fit of the trained model. Similiar to how log-likelihood loss function is used for Gaussian linear regression.

Instead of taking the absolute error in, pinball loss function has different error weightings for each different quantile. We usually take an upper quantile exceeding 50th percentile as the upper bound of prediction interval and another lower quantile below 50th percentile as the lower bound of interval. When the actual output lays outside the prediction interval, the loss function assigns a heavier weight for the absolute error and vice versa. We eventually minimise the loss function by adjusting the parameters of the quantile line iteratively.

Let’s see how weighting differs between different lower quantiles (10th vs 25th percentile). Percentile indicates the probability of having the true value below the estimation value. Comparing to the 25th percentile, the 10th percentile has a smaller expected probability of having the true value below the line. When the actual lays outside the prediction interval (i.e., the actual value is smaller than the predicted one), it is more problematic for a 10th percentile to have such error, as it expects 90% of the true value to be above the line, as opposed to the 75% above the line for the 25th percentile.

[6]:
# Fit the model with the observations
metrics = qr_model.fit(df, freq="H")

# Create a new dataframe for the results
# Including 100 historical values and 30 value points for the future
future = qr_model.make_future_dataframe(df, periods=30, n_historic_predictions=100)

# Perform prediction with the trained models
forecast = qr_model.predict(df=future)
WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale.
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.989% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling yearly seasonality. Run NeuralProphet with yearly_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 106
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.989% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.231% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.231% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Prediction made for hospital electric load is visualised below. yhat1 has shown the prediction made at the 50th percentile while the confidence interval has been shaped in light blue. The lower bound is the prediction result of the 10th quantile whereas the upper one is of the 90th.

[7]:
qr_model.plot(forecast)
../../_images/how-to-guides_feature-guides_uncertainty_quantification_14_0.svg

2. Conformal prediction#

While there are different ways to carry out conformal prediction, NeuralProphet adopts the split conformal prediction, which requires a holdout or calibration set. To carry out split conformal prediction, the dataset has to be split into three distinct sets for training, calibration and testing respectively. An initial prediction interval is created with the base model trained by the training dataset. Uncertainty is quantified by comparing the target variables in the calibration set against the predicted value. The final conformal prediction interval is then formed by adding the quantified uncertainty to both tails of the predicted value.

You can select Naive (or absolute residual) or Conformalized Quantile Regression (or CQR) for the conformal prediction in NeuralProphet. We will be discussing these two options in detail in the following sub section.

Calibration and validation set#

At least three subsets (i.e. testing, calibration and testing) are needed in the conformal prediction feature in NeuralProphet. You may choose to opt in a validation subset in this model. If you want to add in a validation subset to train the base model, please make sure the period of the validation subset must be in between of the training and calibration subsets. We will not cover validation procedure in detail here, you can however refer to the Train, Validate and Test procedure tutorial to know how to build up a NeuralProphet model using validation set.

Here, we further divide the calibration set from the training set with a train-calibration ratio of \(1/11\).

[8]:
# Add calibration set using the data splitting function
train_df, cal_df = m.split_df(train_df, freq="H", valid_p=1.0 / 11)
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.988% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

After this splitting, we will have 7467, 746 and 547 instances extracted in sequence as training, calibration and testing set respectively.

[9]:
train_df.shape, cal_df.shape, test_df.shape
[9]:
((7467, 2), (746, 2), (547, 2))

Base model training#

You can build any NeuralProphet model you deemed as fit as the base model. The calibration process in conformal prediction would be later added on the base model to quantify the uncertainty in our final estimation.

We are interested to know how conformal prediction affects different models. Back to our example, we will be comparing the conformal prediction results between a simple quantile regression and a complex 4-layer autoregression model in our illustration here. You can refer to the quantile regression session above and the Autoregression standalone tutorial material for the logic and applications of the features.

[10]:
# Parameter for autoregression
# Predict the value in the next hours based on the last three daysin an-hour steps
n_lags = 3 * 24
[11]:
# Create a simple quantile regression model
cp_model1 = NeuralProphet(quantiles=quantile_list)
cp_model1.set_plotting_backend("plotly-static")

# Create a 4-layer autoregression model as the base
cp_model2 = NeuralProphet(
    yearly_seasonality=False,
    weekly_seasonality=False,
    daily_seasonality=False,
    n_lags=n_lags,
    ar_layers=[32, 32, 32, 32],
    learning_rate=0.003,
    quantiles=quantile_list,
)
cp_model2.set_plotting_backend("plotly-static")

After finish configuring the model, we fit the model with the train set. If you have further split the training dataset into training and validation, you can either (i) concatenate the two datasets in one dataset for training or (ii) assign the training and validation datasets as two separated parameters.

[12]:
# Feed the training subset in the configured NeuralProphet models
# Configure the hourly frequency by assigning 'H' to parameter freq
set_random_seed(0)
metrics1 = cp_model1.fit(train_df, freq="H")
set_random_seed(0)
metrics2 = cp_model2.fit(train_df, freq="H")
WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale.
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.987% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.utils.set_auto_seasonalities) - Disabling yearly seasonality. Run NeuralProphet with yearly_seasonality=True to override this.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 111
WARNING - (NP.config.set_lr_finder_args) - Learning rate finder: The number of batches (234) is too small than the required number                     for the learning rate finder (246). The results might not be optimal.
WARNING - (NP.forecaster.fit) - When Global modeling with local normalization, metrics are displayed in normalized scale.
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.987% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.config.init_data_params) - Setting normalization to global as only one dataframe provided for training.
INFO - (NP.config.set_auto_batch_epoch) - Auto-set batch_size to 32
INFO - (NP.config.set_auto_batch_epoch) - Auto-set epochs to 111

We use the fitted base model to forecast both the point prediction and the quantile regression prediction intervals for the testing dataset.

[13]:
# Perform estimation for the testing data with the trained model
forecast1 = cp_model1.predict(test_df)[n_lags:]
forecast2 = cp_model2.predict(test_df)[n_lags:]
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.818% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Option 1: Naive Conformal Prediction#

After training the base model, we then carry out the calibration process using the naive module. The steps are outlined as follow: i. predict the output value of the instances within the calibration set; ii. calculate absolute residual by comparing the actual and predicted value for each observation in the calibration set; iii. sort all absolute residual in ascending order; iv. finds the quantified uncertainty (\(\hat{q}\)) with the desired confidence level; and v. use the quantified uncertainty (\(\hat{q}\)) to make the final prediciton intervals.

Going back to our example, we need to denote the parameter value for calibration set, significant level (alpha) for conformal prediction on top of the pre-trained models above.

[14]:
# Parameter for naive conformal prediction
method = "naive"
alpha = 1 - confidence_lv

# Enable conformal predict on the pre-trained models
# Evaluate parameter is optional, refer to the "Evaluate Performance" session below
naive_forecast1 = cp_model1.conformal_predict(
    test_df,
    calibration_df=cal_df,
    alpha=alpha,
    method=method,
    plotting_backend="plotly-static",
    show_all_PI=True,
)
naive_forecast2 = cp_model2.conformal_predict(
    test_df,
    calibration_df=cal_df,
    alpha=alpha,
    method=method,
    plotting_backend="plotly-static",
    show_all_PI=True,
)
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Data type cannot be displayed: application/vnd.plotly.v1+json

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.818% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Data type cannot be displayed: application/vnd.plotly.v1+json

The plots above demonstrate how quantified uncertainty (\(\hat{q}\)) change with different confidence level (1-alpha).

Below are the \(\hat{q}\) values for each model. As they are all positive, you will see the naive interval extending beyond that of simple QR. The weaker model has a large gap between these two intervals since it has a larger \(\hat{q}\) value, while the bounds are shifted much less for the complex model.

[15]:
naive_forecast1
[15]:
ds y yhat1 yhat1 - qhat1 yhat1 + qhat1 trend season_weekly season_daily yhat1 5.0% yhat1 95.0%
0 2015-12-09 06:00:00 803.410865 988.585815 827.912791 1149.258840 1031.427246 -35.461987 -7.379447 909.404419 1087.498291
1 2015-12-09 07:00:00 868.264194 1089.076294 928.403269 1249.749319 1031.444336 -43.485428 101.117432 994.053711 1189.946045
2 2015-12-09 08:00:00 975.752982 1146.444092 985.771067 1307.117117 1031.461304 -51.675148 166.657928 1057.043823 1236.407227
3 2015-12-09 09:00:00 983.268943 1155.003296 994.330271 1315.676321 1031.478271 -59.986122 183.511185 1078.018799 1243.008301
4 2015-12-09 10:00:00 1095.825986 1145.466553 984.793528 1306.139578 1031.495239 -68.370621 182.341919 1073.590576 1236.537354
... ... ... ... ... ... ... ... ... ... ...
542 2015-12-31 20:00:00 845.563081 833.224609 672.551584 993.897634 1040.634766 -103.066475 -104.343666 757.201721 912.330200
543 2015-12-31 21:00:00 827.530521 795.515076 634.842051 956.188101 1040.651733 -94.571983 -150.564713 711.605408 872.624634
544 2015-12-31 22:00:00 829.256300 771.299255 610.626230 931.972280 1040.668701 -85.883865 -183.485596 677.626831 873.617920
545 2015-12-31 23:00:00 813.937205 762.211975 601.538950 922.885000 1040.685791 -77.059326 -201.414413 669.592285 869.896973
546 2016-01-01 00:00:00 815.588584 765.104858 604.431834 925.777883 1040.702759 -68.155449 -207.442413 675.509155 861.297852

547 rows × 10 columns

[16]:
naive_qhat1 = naive_forecast1.iloc[-1]["yhat1"] - naive_forecast1.iloc[-1]["yhat1 - qhat1"]  # cp_model1
naive_qhat2 = naive_forecast2.iloc[-1]["yhat1"] - naive_forecast2.iloc[-1]["yhat1 - qhat1"]  # cp_model2
naive_qhat1, naive_qhat2
[16]:
(160.67302487812503, 20.35876320312491)

We can then plot the predicted intervals (5th, 50th and 95th percentile values) to compare the performance of the models. The quantile regression prediction intervals are annoted in blue while the conformal prediction intervals with the quantified uncertainty included are denoted in red.

With the same quantile parameter, the simple model has a much wider quantile regression prediction interval (in blue) comparing to the complex model. The same happens to the conformal prediction intervals (in red), where the weaker model has a wider width, as such it can capture more actual values than the quantile regression prediction intervals.

[17]:
# Date range shown in the plots (optional)
cutoff = 7 * 24

fig1 = cp_model1.highlight_nth_step_ahead_of_each_forecast(1).plot(
    naive_forecast1[-cutoff:], plotting_backend="plotly-static"
)
fig2 = cp_model2.highlight_nth_step_ahead_of_each_forecast(1).plot(
    naive_forecast2[-cutoff:], plotting_backend="plotly-static"
)
WARNING - (NP.forecaster.plot) - highlight_forecast_step_n is ignored since auto-regression not enabled.
../../_images/how-to-guides_feature-guides_uncertainty_quantification_34_1.svg
../../_images/how-to-guides_feature-guides_uncertainty_quantification_34_2.svg

Option 2: Conformalized Quantile Regression#

In Conformalized Quantile Regression, or the cqr module, the method runs as follows: i. calculate non-conformity scores as the differences between data points from the calibration dataset and their nearest prediction quantile, which provides a measure of how well the data fits the current quantile regression model. The non-conformity scores are negative for data points within the quantile regression interval and positive if they are outside the interval; ii. sort list of non-conformity scores; iii. find the value of \(\hat{q}\) such that the portion of scores in the list greater than \(\hat{q}\) is equal to that error rate; and iv. adjust the quantiles from the regression model by an amount (\(\hat{q}\)).

There are two scenarios for what the CQR model means based on the value of \(\hat{q}\). If the one-sided prediction interval width adjustment is positive, then CQR extends beyond the QR intervals, as it deems the QR interval to be too confident. Conversely, if the one-sided prediction interval width adjustment is negative, then CQR contracts the QR intervals, as it deems the QR interval to be too conservative.

[18]:
# Parameter for Conformalized Quantile Regression
method = "cqr"

# Enable conformal predict on the pre-trained models
# Evaluate parameter is optional, refer to the "Evaluate Performance" session below
cqr_forecast1 = cp_model1.conformal_predict(
    test_df, calibration_df=cal_df, alpha=alpha, method=method, plotting_backend="plotly-static"
)
cqr_forecast2 = cp_model2.conformal_predict(
    test_df, calibration_df=cal_df, alpha=alpha, method=method, plotting_backend="plotly-static"
)
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Data type cannot be displayed: application/vnd.plotly.v1+json

INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.866% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.817% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils._infer_frequency) - Major frequency H corresponds to 99.818% of the data.
INFO - (NP.df_utils._infer_frequency) - Defined frequency is equal to major frequency - H
INFO - (NP.df_utils.return_df_in_original_format) - Returning df with no ID column

Data type cannot be displayed: application/vnd.plotly.v1+json

Below are the \(\hat{q}\) values for each model. As they are all positive, you will see the CQR interval extending beyond that of simple QR. cp_model1 has a large gap between these two intervals since it has a larger \(\hat{q}\) value, while the bounds are shifted much less for cp_model2.

We again plot the prediction intervals to examine how this CQR method affects the result.

[19]:
fig1 = cp_model1.highlight_nth_step_ahead_of_each_forecast(1).plot(
    cqr_forecast1[-cutoff:], plotting_backend="plotly-static"
)
fig2 = cp_model2.highlight_nth_step_ahead_of_each_forecast(1).plot(
    cqr_forecast2[-cutoff:], plotting_backend="plotly-static"
)
WARNING - (NP.forecaster.plot) - highlight_forecast_step_n is ignored since auto-regression not enabled.
../../_images/how-to-guides_feature-guides_uncertainty_quantification_39_1.svg
../../_images/how-to-guides_feature-guides_uncertainty_quantification_39_2.svg

Evaluate Performance#

We are using interval width and miscoverage rate as the performance metrics. - interval_width: The average prediction interval, or q_hat multiplied by 2 because it is static or non-adaptive, this is also knowns as the efficiency metric. - miscoverage_rate: The actual miscoverage error rate on the OOS test set, this is also knowns as the validity metric.

The smaller the metrics are, the better the models are performing.

Let’s test with the models we trained above. We first build a dataframe with two rows for the two models we trained and then we will concatenate the Naive and CQR performance metrics in the dataframe for the evaluation.

[20]:
# Create evaluation dataframe skeleton with the 2 models (denoted as m1 and m2)
models = ["m1", "m2"]
eval_df = pd.DataFrame(models, columns=pd.MultiIndex.from_tuples([("model", "", "")]))

We are aggregating the performance metrics for naive and put them in the evaluation dataframe skeleton at this session. The performance metrics are calculated when you have turned on the evaluate parameter in the conformal prediction feature.

[21]:
# Extract the naive performance metrics from their respective forecast datasets
naive_eval1 = uncertainty_evaluate(naive_forecast1)
naive_eval2 = uncertainty_evaluate(naive_forecast2)

# Aggregate the naive performance metrics for m1 and m2
naive_evals = [naive_eval1, naive_eval2]
naive_eval_df = pd.concat(naive_evals).reset_index(drop=True)
[22]:
# Extract the cqr performance metrics from their respective forecast datasets
cqr_eval1 = uncertainty_evaluate(cqr_forecast1)
cqr_eval2 = uncertainty_evaluate(cqr_forecast2)

# Aggregate the cqr performance metrics for m1 and m2
cqr_evals = [cqr_eval1, cqr_eval2]
cqr_eval_df = pd.concat(cqr_evals).reset_index(drop=True)

Lastly, we concatenate the naive and cqr evaluation dataframes and then compare how the models are performing with the naive and cqr prediction.

[23]:
# Concatenate the naive and cqr evaluation dataframes
eval_df = pd.concat([naive_eval_df, cqr_eval_df], axis=1, keys=["naive", "cqr"])
eval_df
[23]:
naive cqr
yhat1 yhat1
interval_width miscoverage_rate interval_width miscoverage_rate
0 321.346050 0.107861 309.400519 0.107861
1 40.717526 0.065263 41.712111 0.063158

Performance of the trained model above:

  • This notebook is only using single forecast timestep models, hence we would only have yhat1 as the point estimator.

  • Across all three uncertainty prediction methods, the complex autoregression model (m2) has a smaller interval_width and miscoverage_rate than the simpler model (m1). As such, we can conclude that the more complex the model is, the better the data are fitted and the more accurate the predictions are made.

  • Both Naive and CQR conformal prediction methods significantly outperforms vanilla QR in terms of miscoverage_rate. This shows that the vanilla QR is overconfident in its quantile range. The interval_width should be broadened further to converge its actual miscoverage_rate (on out-of-sample test set) to the specified alpha at 0.1.

  • The interval_width doubles the quantified uncertainties (qhat1) for the Naive conformal predictions, as such you will be seeing symmetrical prediction intervals.

  • When we are only looking at the simple quantile regression model (m1), CQR is more preferable. Though it has the same miscoverage_rate as Naive, it still has a narrower prediction interval_width.

  • As for the complex model (m2), Naive has a slightly better interval_width, but CQR has a slightly better miscoverage_rate. You may want to feed the model with more data to determine which method more preferable.