Tutorial 2.a: Representing and Evaluating Uncertainty for Classification ======================================================================== The structure of this tutorial will mirror that of Tutorial 1.a. Tutorial 1.a focuses on regression problems, while the current tutorial focuses on classification problems. Before we start to work with any predictions, we must first think about how to represent our prediction. For example, when predicting image classes, we can represent the prediction as a categorical distribution over all possible labels, or as a set of likely labels. Each representation has its pros and cons. Depending on the different requirements during training/deployment, we may even want to convert between different representations. This notebook aims to introduce some popular representations, as well as metrics to measure the quality of the predictions. We first list the types of predictions currently supported by torchuq for classification. You can skip this part and come back later as a reference. +-------------------+------------------------+-------------+----------+ | Name | Variable type/shape | Special | torchuq | | | | requirement | sub-modu | | | | | le | | | | | for | | | | | evaluati | | | | | on | +===================+========================+=============+==========+ | Topk | ``int array [batch_siz | Each | ``torchu | | | e] or [batch_size, k]` | element | q.evalua | | | ` | take values | te.topk` | | | | in | ` | | | | ``{0, 1, .. | | | | | ., num_clas | | | | | ses}`` | | +-------------------+------------------------+-------------+----------+ | Categorical | ``float32 array [batc | Elements | ``torchu | | | h_size, num_classes]`` | should be | q.evalua | | | | in | te.categ | | | | :math:`[0, | orical`` | | | | 1]` | | | | | and sum to | | | | | :math:`1` | | +-------------------+------------------------+-------------+----------+ | USet | ``int array [batch_siz | Elements | ``torchu | | | e, num_classes]`` | are 0 or 1 | q.evalua | | | | | te.uset` | | | | | ` | +-------------------+------------------------+-------------+----------+ | Ensemble | ``dict: name -> predic | name must | Unavaila | | | tion`` | start with | ble | | | | prediction | | | | | type and a | | | | | string | | | | | (with no | | | | | special | | | | | characters) | | | | | , | | | | | such as | | | | | ‘categorica | | | | | l_1’ | | +-------------------+------------------------+-------------+----------+ .. code:: python # We must first import the dependencies, and make sure that the torchuq package is in PYTHONPATH # If you are running this notebook in the original directory as in the repo, then the following statement should work import sys sys.path.append('../..') # Include the directory that contains the torchuq package import torch from matplotlib import pyplot as plt As a running example, we will use existing predictions for CIFAR-10. We first load these predictions. .. code:: python reader = torch.load('pretrained/resnet18-cifar10.pt') # These functions transform categorical predictions into different types of predictions # We will discuss transformations later, but for now we will simply use it to generate our example predictions from torchuq.transform.direct import * predictions_categorical = reader['categorical'] predictions_uset = categorical_to_uset(reader['categorical']) predictions_top1 = categorical_to_topk(reader['categorical'], 1) predictions_top3 = categorical_to_topk(reader['categorical'], 3) labels = reader['labels'] 1. Top-k Prediction ~~~~~~~~~~~~~~~~~~~ The simplest type of prediction specifies the top-k labels (i.e. the k most likely predicted labels). The labels are represented as integers :math:`\lbrace 0, 1, \cdots, \text{n classes}-1 \rbrace`. A batch of top-k prediction is represented by an integer array of shape ``[batch_size, k]``, where ``predictions[i, :]`` is a sequence of labels (which are represented as integers). A top-1 prediction can be either represented as an array of shape ``[batch_size, 1]`` or more conveniently as an array of shape ``[batch_size]``. Here, we first verify that the loaded top3 and top1 predictions have the correct shape. .. code:: python print(predictions_top1.shape) print(predictions_top3.shape) .. parsed-literal:: torch.Size([10000]) torch.Size([10000, 3]) A very natural way to visualize the quality of a top-1 prediction is by the confusion matrix: among the samples that are predicted as class :math:`i`, how many of them actually belong to class :math:`j`. To plot a confusion matrix in torchuq use ``torchuq.evaluate.topk.plot_confusion_matrix``. .. code:: python from torchuq.evaluate import topk topk.plot_confusion_matrix(predictions_top1, labels); .. image:: output_8_0.png We can also evaluate metrics for these predictions, such as accuracy .. code:: python print(topk.compute_accuracy(predictions_top1, labels)) print(topk.compute_accuracy(predictions_top3, labels)) .. parsed-literal:: tensor(0.9524) tensor(0.9951) 2. Categorical Prediction ~~~~~~~~~~~~~~~~~~~~~~~~~ The categorical prediction is perhaps the most useful prediction type for classification. This type of prediction returns the probability that a label is correct for each possible label. In torchuq a categorical prediction is represented as a float array of shape ``[batch_size, n_classes]``, where ``predictions[i, j]`` is the probability that the :math:`i`-th sample takes the :math:`j`-th label. .. code:: python print(predictions_categorical.shape) .. parsed-literal:: torch.Size([10000, 10]) **Confidence Calibration**. Given a categorical prediction :math:`p \in [0, 1]^{\text{n classes}}`, the confidence of the prediction is the largest probability in the array: :math:`\max_i p_i`. If this largest probability is close to 1, then the prediction is highly confident. A simple but important requirement for this type of prediction is confidence calibration: among the samples with confidence :math:`c`, the top-1 accuracy should also be :math:`c`. For instance, if a model is 90% confident in each of 100 predictions, it should predict the correct label for 90 of the samples. If this property doesn’t hold, then these confidence estimates are not meaningful. We can visualize confidence calibration by plotting the reliability diagram, which plots the (actual) accuracy :math:`a` among samples with predicted confidence :math:`c` vs. the predicted confidence :math:`c`. Ideally the predicted confidence :math:`c` will be equal to the actual accuracy :math:`a`, so a perfectly calibrated model will yield a diagonal :math:`a=c` line. Deviations from this line represent miscalibration. As an example, we plot the reliability diagram for our example predictions below, and it is clear that the predictions are not well-calibrated. For example, among all samples with a confidence of about 0.9, the accuracy is only about 0.8. Hence the accuracy is lower than the confidence, and the predictions are over-confident. We can also compute the expected calibration error (ECE), which is a single number that measures mis-calibration. The ECE measures the average deviation from the ideal :math:`a=c` line. In practice, the ECE is approximated by binning — partitioning the predicted confidences into bins, and then taking a weighted average of the difference between the accuracy and average confidence for each bin. Pictorially, it is the average distance between the blue bars and the diagonal in the reliability diagram below. .. code:: python from torchuq.evaluate import categorical categorical.plot_reliability_diagram(predictions_categorical, labels, binning='uniform'); print('ECE-error is %.4f' % categorical.compute_ece(predictions_categorical, labels, num_bins=15)) .. parsed-literal:: ECE-error is 0.0277 .. image:: output_14_1.png 3. Uncertainty Set Prediction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The next type of representation is (uncertainty) set predictions. Uncertainty sets are almost the same as top-k; the main difference is that for top-k predictions, k must be specificed a priori, while for uncertainty sets, k can be different for each sample. In torchuq, uncertainty set predictions are represented by an integer array of shape ``[batch_size, n_classes]``, where ``predictions[i, j] = 1`` indicates that the :math:`i`-th sample includes the :math:`j`-th label in its uncertainty set, and ``predictions[i, j] = 0`` indicates that it is not. For set predictions, there are two important properties to consider: - The coverage: the frequency with which the true label belongs to the predicted set. A high coverage means that the true label almost always belong to the predicted set. - The set size: the number of elements in the prediction set Ideally, we would like high coverage with a small set size. We compute the coverage and the set size of the example predictions below. .. code:: python from torchuq.evaluate import uset coverage = uset.compute_coverage(predictions_uset, labels) size = uset.compute_size(predictions_uset) print("The coverage is %.3f, average set size is %.3f" % (coverage, size)) .. parsed-literal:: The coverage is 0.987, average set size is 1.268