The most common strategy that people use to make AI smarter is for the Machine Learning models to tell humans when they are uncertain about a task, and then ask the humans for the correct feedback. In general, unlabeled data that confuses a Machine Learning algorithm will be the most valuable when it is labeled and added to the training data. If the Machine Learning algorithm can already label an item with high confidence it is probably correct already.
Interpreting Uncertainty in a Machine Learning Model
Uncertainty Sampling is a strategy for identifying unlabeled items that are near a decision boundary in your current Machine Learning model. While it is easy to identify when a model is confident — there is one result with very high confidence — there are many different ways to calculate uncertainty and your choice will depend on your use case and what is the most effective for your particular data.
four approaches to Uncertainty Sampling:
> Least Confidence Sampling: difference between the most confident prediction and 100% confidence
> Margin of Confidence Sampling: difference between the top two most confident predictions
> Ratio of Confidence Sampling: ratio between the top two most confident predictions.
> Entropy-based Sampling: difference between all predictions, as defined by information theory. In our example, Entropy-based Sampling would capture how much every confidence differed from each other.
We’ll also look at how to determine uncertainty from different types of Machine Learning algorithms, and combining predictions from different models.
Interpreting the scores from your model
almost all Machine Learning models will give you two things:
> A predicted label (or set of predictions)
> A number (or set of numbers) associated with each predicted label.
The numbers are generally interpreted as confidences in the prediction, although this can be more or less true depending on how the numbers are generated. The general principle of Uncertainty Sampling is that if there are mutually exclusive categories with similar confidence, then this is good evidence that the model is confused in the prediction and that a human judgment would be valuable. Therefore, the model will benefit most when it learns to correctly predict the label of an item with an uncertain prediction. For example:
You can see in the example that a “Cyclist” is predicted with a 0.919 score. The scores that might have been “Pedestrian”, “Sign” or “Animal” are 0.014, 0.050 and 0.0168. The four scores total to 1.0, which makes the score like a probability or confidence. For example, you could interpret 0.919 as 91.9% confidence that object is a “Cyclist”. This interpretation is a simplification and it is unlikely that the model is correct 91.9% of the time that it sees objects like this one: it can be widely different. This won’t always matter. If you are ranking the predictions to find the “most uncertain” for human review, the actual uncertainty scores won’t matter. However, even the rank order can change depending on how you interpret the outputs from your model, so it is important to know exactly what statistics are generating these confidences.
“Score”, “Confidence”, and “Probability”: Do not trust the name!
Machine Learning libraries will often use the terms “Score”, “Confidence” and “Probability” interchangeably. This is true of open libraries and commercial ones. You might not even find consistency within the same library. Even when the term “probability distribution” is used, it can mean only that the numbers across the predicted labels add up to 100%. It does not necessarily mean that each number reflects the actual model confidence that the prediction is correct. For neural networks, logistic regression, and other types of related Discriminative Supervised Learning Algorithms, it is not the job of the algorithm to know how confident its predictions are: the job of the algorithm is trying “discriminate” between the labels based on the features, hence the name “Discriminative Supervised Learning”. The raw scores from the last layer of a neural network are the network trying to discriminate between the predictions it is making. Depending on the parameters of the model, those raw scores in the final layer can be any real number. So, the scores that come out of these algorithms often need to be converted into something more closely approximating a confidence.
To complicate things further, you can extend a Discriminative Supervised Learning algorithm with Generative Supervised Learning methods in order to get a truer statistical “probability” straight from the model. However, these are advanced features that I will cover in another series of blogs. You are overwhelmingly more likely to get a probability distribution generated by the softmax algorithm, so we will start there.
SoftMax: converting the model output into confidences
The most common models are Neural Networks and Neural Network predictions are almost always converted to a 0–1 range of scores using softmax.
Softmax is often used as the activation function on the final layer of the model to produce a probability distribution as the set of scores associated with the predicted labels. Softmax can also be used to create a probability distribution from the outputs of a linear activation function.
It is common to use softmax in the final layer or to only look at the result of softmax applied to the logits. However, we generally prefer an architecture in Active Learning that allows us to see logits from a linear activation function, as it has more information. Softmax is lossy and loses the distinction between uncertainty due to strongly competing information vs uncertainty due to a lack of information. To get an intuition for what the softmax transformation in the equation above is doing, let’s break down the pieces. Imagine you had predicted the object in an image and the model gave you raw scores of 1, 4, 2 and 3. The highest number “4” will become the most confident prediction:
As the table above shows with our examples, “Pedestrian” is the most confident prediction, and the confidence numbers are stretched out from raw numbers: 4.0 out of 10.0 in the raw scores becomes 64% in the softmax. The benefits of interpretability should be clear: by converting the numbers into exponentials and normalizing them, we are able to convert an unbounded range of positive and negative numbers into probability estimates that are in a 0–1 range and add up to 1. The exponentials might also more closely map to real probabilities, than if we simply normalized the raw scores. If your model is training using Maximum Likelihood Estimation (MLE), the most popular way to train a neural model, then it is optimizing the log-likelihood. So, using an exponential on log-likelihood takes us to an actual likelihood. However, Maximum Likelihood Estimation puts more emphasis on the misclassified items in order to better classify them, so your scores do not represent the log-probabilities of your entire training data in a way that you can accurately convert with exponentials.
From the same input, compare these two graphs, but using e (2.71828) as the exponential base on the left, and using 10 as the exponential base on the right
The choice of base won’t change which prediction is the most confident. So, it is often overlooked in Machine Learning tasks where people only care about the predictive accuracy over the labels. Ideally, you want your model to know how confident it is. This is true not just for Active Learning, but for a range of other tasks. For example, if your model is 80% confident in a prediction, then you want it to be correct 80% of the time. But if you are relying on the output of your model to have confidences that are an accurate indication of how accurate it is, you can see that the choice of how to produce those confidences is very important. As you saw in the equation above, softmax comes from exponentials of the raw scores. Therefore, when softmax normalizes the exponentials by dividing by all of them: the division of exponentials is essentially subtracting the absolute value of the scores. In other words, it is only the relative difference between the scores from your model that counts with softmax, not their actual values.
it is clear from the picture that, when we multiply all the value the score by 10X. The dramatic change in the difference causes the change in softmax value
Algorithms for Uncertainty Sampling
Now that you understand where the confidences in the model predictions come from, you can think about how to interpret the probability distributions in order to find out where your Machine Learning models are most “uncertain”. Uncertainty Sampling is a strategy for identifying unlabeled items that are near a decision boundary in your current Machine Learning model.
There are many algorithms for calculating uncertainty following the same principles:
> Apply the uncertainty sampling algorithm to a large pool of predictions in order to generate a single uncertainty score for each item.
> Rank the predictions by the uncertainty score.
> Select the top N most uncertain items for human review
> Obtain human labels for the top N items, retrain the model with those items, and iterate on the processes.
Least Confidence sampling
The simplest and most common method for uncertainty sampling is to take the difference between 100% confidence and the most confidently predicted label for each item. Let’s refer to the softmax result σ(z) as the probability of the label given the prediction Pθ (y | x). We know that softmax isn’t strictly giving us probabilities, but these are general equations that apply to probability distributions of any sources, not just from softmax
While you can rank order by confidence alone, it can be useful to convert the uncertainty scores into a 0–1 range, where 1 is the most uncertain score. In that case, we have to normalize the score. We subtract the value from 1 and multiply the result by the number of labels and divide the number of labels — 1. We do this because the minimum confidence can never be less than the one divided by the number of labels, which is when all labels have the same predicted confidence. So, least confidence sampling with a 0–1 range is:
The confidence for “Pedestrian” is all that counts here. Using our example, this uncertainty score would be (1–0.6439) * (4 / 3) = 0.4748
Margin of Confidence sampling
The most intuitive form of uncertainty sampling is the difference between the two most confident predictions. That is, for the label that the model predicted, how much more confident was it than for the next most confident label? This is defined as:
Again, we can convert this to a 0–1 range. We have to subtract from 1.0 again, but the maximum possible score is already 1, so there is no need to multiple by any factor:
“Pedestrian” and “Animal” are the most confident and second most confident prediction. Using our example, this uncertainty score would be 1.0 — (0.6439–0.2369) = 0.5930. This method will not be sensitive to uncertainty of any but the most two confident predictions: with the same difference in confidence for the 1st and 2nd most confident, the 3rd to nth confidences can take any values without changing the uncertainty score. So, if you only care about the uncertainty between the predicted label and the next most confident prediction for your particular use case, then this method is a good starting point.
Ratio of Confidence sampling
Ratio of confidence is a slight variation on Margin of Confidence, looking at the ratio between the top two scores instead of the difference. It is the best Uncertainty Sampling method to improve your understanding of the relationship between confidence and softmax. To make it a little more intuitive, you can think of the ratio as capturing how many times more likely the first label was than the second most confident:
Entropy (classification entropy)
One way to look at uncertainty in a set of predictions is by whether you expect to be surprised by the outcome. This is the concept behind entropy: how surprised would you be each of the possible outcomes, relative to their probability? Entropy applied to a probability distribution is multiplying each probability by its own log, and taking the negative sum of those:
We can convert the entropy into a 0–1 range by dividing by the log of the number of predictions (labels):
Let’s calculate the entropy on our example data:
So “Animal” contributes the most to the final entropy score, even though it is neither the most confidence or least confidence prediction.
Active Learning in “Flow”
> Aron Culotta and Andrew McCallum. 2005. Reducing Labeling Effort for Structured Prediction Tasks. AAAI. https://people.cs.umass.edu/~mccallum/papers/multichoice-aaai05.pdf
> Ido Dagan and Sean P. Engelson. 1995. Committee-based Sampling for Training Probabilistic Classifiers. ICML’95. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.6148
> David D. Lewis and William A. Gale. 1994. A Sequential Algorithm for Training Text Classifiers. SIGIR’94. https://arxiv.org/pdf/cmp-lg/9407020.pdf
> For more recent work focused on Neural Models, including dropouts and Bayesian approaches to better uncertainty estimates, a good entry point is this short paper: Zachary C. Lipton and Aditya Siddhant. 2018. Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study. EMNLP’18 https://www.aclweb.org/anthology/D18-1318