You may not realize it, but you’ve probably used Active Learning before. Filtering your data by keyword or some other pre-processing step is a form of Active Learning, although not a very principled one. Because you are probably using filtered data by the time you build a Machine Learning model, it can be helpful to think of most Machine Learning problems as already being in the middle of the iteration process for Active Learning.
Interpreting model predictions and data to support Active Learning
Almost all Supervised Machine Learning models will give you to things:
> A predicted label (or set of predictions)
> A number (or set of numbers) associated with each predicted label
The numbers are generally interpreted as confidences in the prediction, although this can be more or less true depending on how the numbers are generated. If there are mutually exclusive categories with similar confidence, then this is good evidence that the model is confused about its prediction and that a human judgment would be valuable. Therefore, the model will benefit most when it learns to correctly predict the label of an item with an uncertain prediction
In the rest of Supervised Machine Learning, this label is what people care about most: was the label prediction correct, and what is the overall accuracy of the model when predicting across a large held-out data set? But in Active Learning, it is the numbers associated with the prediction that we typically care about most. You can see in the example that “Not Disaster-Related” is predicted with a 0.524 score. This means that the system is 52.4% confident that the prediction was correct. From the perspective of the task here, you can see why you might want a human to review this anyway: there is a still a relatively high chance that this is disaster related. If it is disaster related, then your model is getting this example wrong for some reason, so it is likely that you want to add it to your training data so that you don’t miss other similar examples
Let’s say we had another message with this prediction:
This item is also predicted as “Not Disaster-Related” but with 98.4% confidence here, compared to 52.4% confidence for the first item. This will generally be true of almost all Machine Learning algorithms and almost all ways of calculating accuracy: you can rank-order the items by the predicted confidence and sample the lowest confidence items. We will also look at ways to combine your Machine Learning strategy with your annotation strategy. If you have worked in Machine Learning for a while but never before in annotation or Active Learning, then you have probably only optimized models for accuracy. For a complete architecture, you might want a more holistic approach where your Annotation, Active Learning and Machine Learning strategies all inform each other. You could decide to implement Machine Learning algorithms that can give more accurate estimates of their confidence at the expense of accuracy in label prediction.
The following pictures encapsulate the iterative process of active learning
What to expect as you iterate
Here are some things you might notice as you iterate through the Active Learning process:
First Iteration: You are annotating mostly “not disaster-related” headlines, and it can feel tedious. It will improve when Active Learning kicks in, but for now it is necessary to get the randomly sampled evaluation data. The model usually makes mistake at the beginning of the training process
Second Iteration: You have created your first model! Your F-Score is probably terrible: maybe only 0.20. However, your AUC might be around 0.75 . You could fix the F-Score by playing with the model parameters and architecture, but more data is more important than model architecture right now. The evidence of this will be clear when you start annotating. In fact, it might be most of them. Early on, your model will still try to predict most things as “positive”, so anything close to 50% confidence is at the most “positive-related” end of the scale. This is one way that Active Learning can be self-correcting: it is over-sampling a lower frequency label without explicitly implementing a targeted strategy for sampling important labels
Third and Fourth Iterations: You should start to see model accuracy improve
Fifth-to-Tenth Iterations: Your models will start to reach reasonable levels of accuracy and you should see more diversity in the headlines. So long as either the F-Score or AUC are going up by a few percent for every 100 annotations, you are getting good gains in accuracy. You are also probably wishing that you had annotated more evaluation data so that you are calculating accuracy on a bigger variety of held-out data. Unfortunately you can’t: it’s almost impossible to go back to truly random sampling unless you are prepared to give up a lot of your existing labels
this process in fact reflects the process of AWS’s SageMaker Ground Truth
Managing Machine Learning data
For a deployed system it is best to store your annotations in a database that takes care of backups, availability, and scalability. However, you cannot always browse a database as easily as files that are on a local machine. In addition to adding training items to your database, or if you are only building a simple system, it can help to have locally stored data and annotations that you can quickly spot-check.
Always get your evaluation data first!
Evaluation data is often called a “test set” or “held-out data,” and for this task it should be a random sample of headlines that we annotate. We will always hold out these headlines from our training data, so that we can track the accuracy of our model after each iteration of Active Learning.
It is important to get the evaluation data first, as there are many ways to inadvertently bias your evaluation data after you have started other sampling techniques. Here are just some of the things that can go wrong if don’t pull out your evaluation data first:
> If you forget to sample evaluation data from your unlabeled items until after you have sampled by low confidence, then your evaluation data will be biased towards the remaining high-confidence items and your model will appear more accurate than it really is.
> If you forget to sample evaluation data and so you pull evaluation data from your training data after you have sampled by confidence, then your evaluation data will be biased towards low-confidence items, and your model will appear less accurate than it really is.
> If you have implemented outlier detection and then later try to pull out evaluation data, it is almost impossible to avoid bias as the items you have pulled out have already contributed to the sampling of additional outliers.
Select the right strategies for your data
The strategies for selecting the data for active learning would depend on your real situation. For example in one particular situation, you know that one event is rare in our data, so the strategy of selecting outliers is not likely to select many related items. Therefore, the example code focuses on selecting by confidence and sampling data for each iteration according to the following strategy:
> 10% randomly selected from unlabeled items
> 80% selected from the lowest confidence items
> 10% selected as outliers.
Retrain the model and iterate
Now that you have your newly annotated items, you can add them to your training data and see the change in accuracy from your model. While Active Learning can be self-correcting, can you see any evidence where it didn’t self-correct some bias? Common examples would be over sampling extra long or extra short sentences. The Computer Vision equivalent would be over sampling images that are extra large/small or hi/low resolution. Your choice of outlier strategy and Machine Learning model might over sample based on features like this which are not core to your goal. You might consider applying the methods in this chapter to different buckets of data, in that case: lowest confident short sentences, lowest confident medium sentences, and lowest confident long sentences. If you come from a Machine Learning background, then your first instinct might be to keep the data constant and start experimenting with more sophisticated neural architectures. That can be the best next step, but it’s rarely the most important next step early on. You should generally get your data right first; and then tuning the Machine Learning architecture becomes a more important task later on in the iterations.