Tutorial 2.b: Learning Calibrated Probabilities: The Basics =========================================================== Calibration is a basic requirement for probabilistic predictions. However, many classification models are not intrinsically well-calibrated. Thus, there exist several methods that apply a post-processing adjustment to the classifier to improve the model’s calibration. In this tutorial we will cover some basic recalibration techniques. .. raw:: html .. code:: python # We must first import the dependencies, and make sure that the torchuq package is in PYTHONPATH # If you are running this notebook in the original directory as in the repo, then the following statement should work import sys sys.path.append('../..') # Include the directory that contains the torchuq package import torch from matplotlib import pyplot as plt .. code:: python reader = torch.load('pretrained/resnet18-cifar10.pt') predictions = reader['categorical'] labels = reader['labels'] Most calibration algorithms are in the ``torchuq.transform.calibrate`` module, with the exception of decision calibration, which is in the ``torchuq.transform.decision`` module. .. code:: python from torchuq.evaluate import categorical from torchuq.transform.calibrate import * Torchuq implements many recalibration algorithms. Below is a list of currently supported algorithms: - Standard calibration methods: temperature scaling, histogram binning - Classwise calibration methods: dirichlet recalibration - Advanced calibration methods: Decision calibration Standard Calibration Methods ---------------------------- Temperature Scaling ~~~~~~~~~~~~~~~~~~~ Temperature scaling is a recalibration method for neural networks that learns a single temperature parameter and rescales the confidence scores for all samples simultaneously. For each input :math:`x`, a neural network typically first computes a logit score :math:`l_1(x), \dots, l_n` for each of the :math:`n` possible classes. It then applies a softmax function to compute the confidence score: .. math:: \hat{p}(x) = \max_i \frac{e^{l_i(x)}}{\sum_j e^{l_j(x)}} Temperature scaling simply adds a temperature parameter :math:`T` to the softmax function: .. math:: \hat{p}(x, T) = \max_i \frac{e^{l_i(x)/T}}{\sum_j e^{l_j(x)/T}} A higher temperature (:math:`T>1`) will reduce the confidence, and a lower temperature (:math:`T<1`) will increase the confidence. :math:`T` is trained to minimize the standard cross entropy loss on a validation dataset. Note that the *accuracy* of the classifier will not change — the relative ordering of the :math:`n` possible classes will remain the same, so the predicted class will remain the same. Only the *confidence* of the predicted class will change. Despite its simplicity, temperature scaling performs well empirically in classification calibration for deep neural networks [1]. In torchuq, all calibrator classes work in essentially the same way, so this may sound familiar if you have seen the tutorial for regression. In particular, to use ``TemperatureScaling`` there are only three functions that you need to know: 1. Constructor: ``calibrator = TemperatureScaling``. 2. Train: ``TemperatureScaling.train(val_preds, val_labels)`` trains the predictor based on validation predictions and validation labels 3. Test: ``test_intervals = TemperatureScaling(test_preds)`` outputs the recalibrated predictions .. code:: python categorical.plot_reliability_diagram(predictions, labels); .. image:: output_8_0.png .. code:: python calibrator = TemperatureScaling(verbose=True) calibrator.train(predictions, labels) predictions_ts = calibrator(predictions) .. parsed-literal:: _change_device is deprecated Iteration 0, lr=0.00100, NLL=0.178 Iteration 100, lr=0.00100, NLL=0.170 Iteration 200, lr=0.00100, NLL=0.165 Iteration 300, lr=0.00100, NLL=0.163 Iteration 400, lr=0.00100, NLL=0.162 Iteration 500, lr=0.00100, NLL=0.162 Iteration 600, lr=0.00100, NLL=0.162 Iteration 700, lr=0.00050, NLL=0.162 _change_device is deprecated .. code:: python categorical.plot_reliability_diagram(predictions_ts, labels); .. image:: output_10_0.png Histogram Binning ~~~~~~~~~~~~~~~~~ Histogram binning is another popular recalibration method for classification models. Histogram binning partitions the confidence scores into bins :math:`\{[0, \epsilon), [\epsilon, 2\epsilon), \cdots, [1-\epsilon, 1] \}`, and sorts each validation sample into a bin based on its predicted confidence. It then resets the confidence score for each element in a bin to match the average classification accuracy of elements in that bin. With torchuq, it is extremely easy to use a different recalibration method — simply replace the TemperatureScaling class with the HistogramBinning class. In this case, histogram binning appears to achieve better calibration than temeprature scaling, as shown in the visualized reliability diagram. .. code:: python calibrator = HistogramBinning(verbose=True) calibrator.train(predictions, labels) predictions_hb = calibrator(predictions) categorical.plot_reliability_diagram(predictions_hb, labels); .. parsed-literal:: Top-1 accuracy of predictor is 0.952 Number of histogram binning bins is 20 tensor([-1.0000, 0.8616, 0.9733, 0.9918, 0.9962, 0.9977, 0.9983, 0.9986, 0.9988, 0.9990, 0.9991, 0.9992, 0.9992, 0.9993, 0.9994, 0.9994, 0.9995, 0.9996, 0.9996, 0.9997, 2.0000]) .. image:: output_12_1.png Classwise Calibration Methods ----------------------------- Sometimes, overall confidence calibration is insufficient. In many applications, we may instead want to achieve the stronger notion of *classwise* calibration. Classwise calibration groups samples by their true class label, and considers the ECE within each class; this may be useful (for instance) for ensuring that different groups are similarly well calibrated. Dirichlet Calibration ~~~~~~~~~~~~~~~~~~~~~ One method for achieving classwise calibration is Dirichlet calibration [2]. This method learns a calibration map that transforms the predicted probabilities based on a Dirichlet probability distribution. In neural networks, it can be implemented as additional layers: a logarithmic transformation followed by a fully connected layer with softmax activation. .. code:: python calibrator = DirichletCalibrator(verbose=True) calibrator.train(predictions, labels) predictions_dc = calibrator(predictions) categorical.plot_reliability_diagram(predictions_dc, labels); .. parsed-literal:: _change_device is deprecated Finished training 0 epochs, lr=0.00100, nll = 4.563, reg = 3.135 Finished training 10 epochs, lr=0.00100, nll = 0.457, reg = 2.039 Finished training 20 epochs, lr=0.00100, nll = 0.231, reg = 1.139 Finished training 30 epochs, lr=0.00100, nll = 0.187, reg = 0.574 Finished training 40 epochs, lr=0.00100, nll = 0.170, reg = 0.287 Finished training 50 epochs, lr=0.00100, nll = 0.162, reg = 0.145 Finished training 60 epochs, lr=0.00100, nll = 0.158, reg = 0.083 Finished training 70 epochs, lr=0.00100, nll = 0.155, reg = 0.051 Finished training 80 epochs, lr=0.00100, nll = 0.154, reg = 0.043 Finished training 90 epochs, lr=0.00100, nll = 0.152, reg = 0.039 Finished training 100 epochs, lr=0.00100, nll = 0.151, reg = 0.039 Finished training 110 epochs, lr=0.00050, nll = 0.150, reg = 0.042 Finished training 120 epochs, lr=0.00013, nll = 0.150, reg = 0.043 Finished training 130 epochs, lr=0.00006, nll = 0.150, reg = 0.042 Finished training 140 epochs, lr=0.00003, nll = 0.150, reg = 0.042 _change_device is deprecated .. image:: output_14_1.png Advanced Calibration Methods ---------------------------- Decision Calibration ~~~~~~~~~~~~~~~~~~~~ [TBD] References ---------- [1] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Weinberger. “On Calibration of Modern Neural Networks.” International Conference on Machine Learning (2017). [2] Meelis Kull, Miquel Perello-Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. “Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration.” NeurIPS (2019).