Machine Learning Basics
What is Machine Learning
In all but the most trivial cases, insight or knowledge you’re trying to get out of the raw data won’t be obvious from looking at the data. For example, in detecting spam email, looking for the occurrence of a single word may not be very helpful. But looking at the occurrence of certain words used together, combined with the length of the email and other factors, you could get a much clearer picture of whether the email is spam or not. Machine learning is turning data into information. Machine learning lies at the intersection of computer science, engineering, and statistics and often appears in other disciplines. It can be applied to many fields from politics to stock market. It’s a tool that can be applied to many problems. Any field that needs to interpret and act on data can benefit from machine learning techniques
Before we jump into the machine learning algorithms, it would be best to explain some terminology. The best way to do so is through an example of a system someone may want to make. We’ll go through an example of building a bird classification system. This sort of system is an interesting topic often associated with machine learning called expert systems. By creating a computer program to recognize birds, we’ve replaced an ornithologist with a computer. The ornithologist is a bird expert, so we’ve created an expert system.
In table 1.1 are some values for four parts of various birds that we decided to measure. We chose to measure weight, wingspan, whether it has webbed feet, and the color of its back. In reality, you’d want to measure more than this. It’s common practice to measure just about anything you can measure and sort out the important parts later. The four things we’ve measured are called features; these are also called attributes. Each of the rows in table 1.1 is an instance made up of features.
The first two features in table 1.1 are numeric and can take on decimal values. The third feature (webbed feet) is binary: it can only be 1 or 0. The fourth feature (back color) is an enumeration over the color palette we’re using, and I just chose some very common colors. One task in machine learning is classification; I’ll illustrate this using table 1.1 and the fact that information about an Ivory-billed Woodpecker could get us $50,000. We want to identify this bird out of a bunch of other birds, and we want to profit from this. We could set up a bird feeder and then hire an ornithologist (bird expert) to watch it and when they see an Ivory-billed Woodpecker give us a call. This would be expensive, and the person could only be in one place at a time. We could also automate this process: set up many bird feeders with cameras and computers attached to them to identify the birds that come in. We could put a scale on the bird feeder to get the bird’s weight and write some computer vision code to extract the bird’s wingspan, feet type, and back color. For the moment, assume we have all that information. How do we then decide if a bird at our feeder is an Ivory-billed Woodpecker or something else? This task is called classification, and there are many machine learning algorithms that are good at classification. The class in this example is the bird species; more specifically, we can reduce our classes to Ivory-billed Woodpecker or everything else.
Say we’ve decided on a machine learning algorithm to use for classification. What we need to do next is train the algorithm, or allow it to learn. To train the algorithm we feed it quality data known as a training set. A training set is the set of training examples we’ll use to train our machine learning algorithms. In table 1.1 our training set has six training examples. Each training example has four features and one target variable; this is depicted in figure 1.2. The target variable is what we’ll be trying to predict with our machine learning algorithms. In classification the target variable takes on a nominal value, and in the task of regression its value could be continuous. In a training set the target variable is known. The machine learns by finding some relationship between the features and the target variable. The target variable is the species, and as I mentioned earlier, we can reduce this to take nominal values. In the classification problem the target variables are called classes, and there is assumed to be a finite number of classes
To test machine learning algorithms what’s usually done is to have a training set of data and a separate data set, called a test set. Initially the program is fed the training examples; this is when the machine learning takes place. Next, the test set is fed to the program. The target variable for each example from the test set isn’t given to the program, and the program decides which class each example should belong to. The target variable or class that the training example belongs to is then compared to the predicted value, and we can get a sense for how accurate the algorithm is.
In our bird classification example, assume we’ve tested the program and it meets our desired level of accuracy. Can we see what the machine has learned? This is called knowledge representation. The answer is it depends. Some algorithms have knowledge representation that’s more readable by humans than others. The knowledge representation may be in the form of a set of rules; it may be a probability distribution or an example from the training set. In some cases we may not be interested in building an expert system but interested only in the knowledge representation that’s acquired from training a machine learning algorithm
Key tasks of machine learning
In this section we’ll outline the key jobs of machine learning and set a framework that allows us to easily turn a machine learning algorithm into a solid working application. The example covered previously was for the task of classification. In classification, our job is to predict what class an instance of data should fall into. Another task in machine learning is regression. Regression is the prediction of a numeric value. Most people have probably seen an example of regression with a best-fit line drawn through some data points to generalize the data points. Classification and regression are examples of supervised learning. This set of problems is known as supervised because we’re telling the algorithm what to predict. The opposite of supervised learning is a set of tasks known as unsupervised learning. In unsupervised learning, there’s no label or target value given for the data. A task where we group similar items together is known as clustering. In unsupervised learning, we may also want to find statistical values that describe the data. This is known as density estimation. Another task of unsupervised learning may be reducing the data from many features to a small number so that we can properly visualize it in two or three dimensions
Steps in developing a machine learning application
Collect data. You could collect the samples by scraping a website and extracting data, or you could get information from an RSS feed or an API. You could have a device collect wind speed measurements and send them to you, or blood glucose levels, or anything you can measure. The number of options is endless. To save some time and effort, you could use publicly available data.
Prepare the input data. Once you have this data, you need to make sure it’s in a useable format.The benefit of having this standard format is that you can mix and match algorithms and data sources.
Analyze the input data. This is looking at the data from the previous task. This could be as simple as looking at the data you’ve parsed in a text editor to make sure steps 1 and 2 are actually working and you don’t have a bunch of empty values. You can also look at the data to see if you can recognize any patterns or if there’s anything obvious, such as a few data points that are vastly different from the rest of the set. Plotting data in one, two, or three dimensions can also help. But most of the time you’ll have more than three features, and you can’t easily plot the data across all features at one time. You could, however, use some advanced methods we’ll talk about later to distill multiple dimensions down to two or three so you can visualize the data
Train the algorithm. This is where the machine learning takes place. This step and the next step are where the “core” algorithms lie, depending on the algorithm. You feed the algorithm good clean data from the first two steps and extract knowledge or information
Test the algorithm. This is where the information learned in the previous step is put to use. When you’re evaluating an algorithm, you’ll test it to see how well it does
Use it. Here you make a real program to do some task, and once again you see if all the previous steps worked as you expected
Machine learning is already being used in your daily lives even though you may not be aware of it. The amount of data coming at you isn’t going to decrease, and being able to make sense of all this data will be an essential skill for people working in a data-driven industry. In machine learning, you look at instances of data. Each instance of data is composed of a number of features. Classification, one the popular and essential tasks of machine learning, is used to place an unknown piece of data into a known group. In order to build or train a classifier, you feed it data for which you know the class. This data is called your training set
I don’t claim that our expert system used to recognize birds will be perfect or as a good as a human. But building a machine with accuracy close to that of a human expert could greatly increase the quality of life. When we build software that can match the accuracy of a human doctor, people can more rapidly get treatment. Better prediction of weather could lead to fewer water shortages and a greater supply of food. The examples where machine learning could be useful are endless.