Active learning — human in the loop Machine Learning (P2)

You may not realize it, but you’ve probably used Active Learning before. Filtering your data by keyword or some other pre-processing step is a form of Active Learning, although not a very principled one. Because you are probably using filtered data by the time you build a Machine Learning model, it can be helpful to think of most Machine Learning problems as already being in the middle of the iteration process for Active Learning.

Interpreting model predictions and data to support Active Learning

Almost all Supervised Machine Learning models will give you to things:
> A predicted label (or set of predictions)
> A number (or set of numbers) associated with each predicted label

prediction + score

What to expect as you iterate

Here are some things you might notice as you iterate through the Active Learning process:


Managing Machine Learning data

For a deployed system it is best to store your annotations in a database that takes care of backups, availability, and scalability. However, you cannot always browse a database as easily as files that are on a local machine. In addition to adding training items to your database, or if you are only building a simple system, it can help to have locally stored data and annotations that you can quickly spot-check.

Always get your evaluation data first!

Evaluation data is often called a “test set” or “held-out data,” and for this task it should be a random sample of headlines that we annotate. We will always hold out these headlines from our training data, so that we can track the accuracy of our model after each iteration of Active Learning.

Select the right strategies for your data

The strategies for selecting the data for active learning would depend on your real situation. For example in one particular situation, you know that one event is rare in our data, so the strategy of selecting outliers is not likely to select many related items. Therefore, the example code focuses on selecting by confidence and sampling data for each iteration according to the following strategy:
> 10% randomly selected from unlabeled items
> 80% selected from the lowest confidence items
> 10% selected as outliers.

Retrain the model and iterate

Now that you have your newly annotated items, you can add them to your training data and see the change in accuracy from your model. While Active Learning can be self-correcting, can you see any evidence where it didn’t self-correct some bias? Common examples would be over sampling extra long or extra short sentences. The Computer Vision equivalent would be over sampling images that are extra large/small or hi/low resolution. Your choice of outlier strategy and Machine Learning model might over sample based on features like this which are not core to your goal. You might consider applying the methods in this chapter to different buckets of data, in that case: lowest confident short sentences, lowest confident medium sentences, and lowest confident long sentences. If you come from a Machine Learning background, then your first instinct might be to keep the data constant and start experimenting with more sophisticated neural architectures. That can be the best next step, but it’s rarely the most important next step early on. You should generally get your data right first; and then tuning the Machine Learning architecture becomes a more important task later on in the iterations.

AI Researcher - NLP Practitioner