Unlike robots in the movies, most of today’s Artificial Intelligence (AI) cannot learn by itself: it relies on intensive human feedback. Probably 90% of Machine Learning applications today are powered by Supervised Machine Learning. So this raises one of the most important questions in technology today: what are the right ways for humans and machine learning algorithms to interact to solve problems?. this series of blogs help you to answer these questions.
The Basic Principles of Human-in-the-Loop Machine Learning
Human-in-the-Loop Machine Learning is when humans and Machine Learning processes interact to solve one or more of the following:
> Making Machine Learning more accurate
> Getting Machine Learning to the desired accuracy faster
> Making humans more accurate
> Making humans more efficient
What is Annotation?
Annotation is the process of labeling raw data so that it becomes training data for Machine Learning. I myself spend much more time curating and annotating data sets than they spend actually building the Machine Learning models. algorithms and annotation are equally important and intertwined components of good Machine Learning
In contrast to academic Machine Learning, it is more common in industry to improve model performance by annotating more training data. Especially when the nature of the data is changing over time (which is also common) then just a handful of new annotations can be far more effective than trying to adapt an existing Machine Learning model to a new domain of data. But far more academic papers have focused on how to adapt algorithms to new domains without new training data than have focused on how to efficiently annotate the right new training data. Because of this imbalance in academia, I’ve often seen people in industry make the same mistake. They will hire a dozen smart PhDs in Machine Learning who will know how to build state-of-the-art algorithms, but who won’t have experience creating training data or thinking about the right interfaces for annotation
What is Active Learning
Supervised learning models almost always get more accurate with more labelled data. Active Learning is the process of selecting which data needs to get a human label.
There are many Active Learning strategies and many algorithms for implementing them. But there are three basic approaches that work well in most contexts and should almost always be the starting point: uncertainty sampling, diversity sampling, and random sampling
Uncertainty Sampling is a strategy for identifying unlabeled items that are near a decision boundary in your current Machine Learning model. If you have a binary classification task, these will be items that are predicted close to 50% probability of belonging to either label, and therefore the model is “uncertain” or “confused”. Diversity Sampling is a strategy for identifying unlabeled items that are unknown to the Machine Learning model in its current state. Other types of Diversity Sampling, like Representative Sampling, are explicitly trying to find the unlabeled items that most look like the unlabeled data, compared to the training data. For example, Representative Sampling might find unlabeled items in text documents that have words that are really common in the unlabeled data but aren’t yet in the training data
It is important to note that the Active Learning process is iterative. In each iteration of Active Learning, a selection of items are identified and receive a new human-generated label. The model is then re-trained with the new items and the process is repeated
Machine Learning-Assisted Humans vs Human-Assisted Machine Learning
There can be two distinct goals in Human-in-the-Loop Machine Learning: making a Machine Learning application more accurate with human input, and improving a human task with the aid of Machine Learning.