“Organizing is what you do before you do something, so that when you do it, it is not all mixed up”
In this Section we will look at one of the most popular tasks in NLP — text classification. It concerns with assigning one or more groups for a given piece of text, from a larger set of possible groups. It has a wide range of applications across diverse domains such as social media, e-commerce, healthcare, law, marketing, to name a few. A common example of text classification in our daily lives is classifying emails as spam and non-spam. Even though the purpose and application of text classification may vary from domain to domain, the underlying abstract problem remains the same. This invariance of the core problem and its applications in a myriad of domains makes text classification by far the most widely used NLP task in industry and researched in academia. In this chapter, we will discuss the usefulness of text classification, and how to build text classifiers for your use cases, along with some practical tips for real world scenarios.
A taxonomy of text classification
Any supervised classification approach, including text classification can be further distinguished into three types based on the number of categories involved: binary, multiclass and multilabel classification. If the number of classes is two, it is called binary classification. If the number of classes are more than two, it is referred to as multiclass classification. Thus, classifying an email as spam or not spam is an example of binary classification setting. Classifying sentiment of a customer review as negative, neutral or positive is an example of multiclass classification. In both binary and multiclass settings, each document belongs to exactly one class from C, where C is the set of all possible classes. In multi-label classification, a document can have one or more labels/classes attached to it. For example, a news article on soccer match may simultaneously belong to more than one category such as “sports” and “soccer”. While another news article, say on US elections, may have labels : “politics”, “USA”, and “elections”. Thus, each document has labels that are subset of C. Each article can be in no class, exactly one, multiple classes or all the classes. Sometimes the number of labels in the set C can be very large (known as “extreme classification”)
Content classification and organization: This refers to the task of classifying/tagging large amounts of textual data. This in turn is used to powers use cases like content organization, search and recommendation, to name a few. Examples of such data include news websites, blogs, online book shelves, product reviews, tweets etc. Tagging product descriptions in an e-commerce website, routing customer service requests in a company to the appropriate support team, organizing our emails into personal/social/promotions etc on gmail — are all examples of using text classification for content classification and organization.
Customer Support: Customers commonly use social media to express their opinions/experience of a product or service.
E-Commerce: Customers leave their reviews for a range of products on e-commerce websites such as amazon, ebay etc. An example use of text classification in such scenarios is to understand and analyse the perception of customers towards a product or service based on their comments.
A Pipeline for Building Text Classification Systems
One typically follows the following steps in building a text classification system:
- Collect or create a labeled dataset suitable for the task
- Split the dataset into two (train and test) or three parts — train, validation (a.k.a development) and test set, decide on evaluation metric(s)
- Transform raw text into feature vectors
- Train a classifier using the feature vectors and the corresponding labels from the training set
- Using the evaluation metric(s) from step 2, benchmark the model performance on the test set
- Deploy the model to serve the real-world use case and monitor its performance.
One Pipeline, Many Classifiers
Let us now look at building text classifiers by altering steps 3–5 in the pipeline and keeping the remaining steps constant. Good dataset is a prerequisite to start using the pipeline. When we say “good” dataset, we mean a dataset that is a true representation of the data we are likely to see in the production. No single approach is known to work universally well on all kinds of data and all classification problems. In the real-world, we will experiment with multiple approaches, evaluate them, and choose one final approach to deploy in practice.
Naive Bayes Classifier
Naive Bayes is a probabilistic classifier which uses Bayes’ theorem to classify texts, based on the evidence seen in training data. It estimates the conditional probability of each feature of a given text for each class based on the occurrence of that feature in that class, and multiplies these probabilities of all the features of a given text to compute the final probability of classification for each class. Finally, it chooses the class with maximum probability. Let us walk through the key steps of an implementation of the pipeline described earlier for our dataset. For this, we use Naive Bayes implementation in sklearn. Once the dataset if loaded, we split the data into train and test data, as shown in the code snippet below:
The next step is to pre-process the texts and then convert them into feature vectors. While there are many different ways to do the pre-processing, let us say we want to do the following: lowercasing, removal of punctuation, digits and any custom strings, and stopwords. The below code snippet shows this pre-processing and converting the train and test data into feature vectors using CountVectorizer in sklearn
Once you run this in the notebook, you will see that we ended up having a feature vector with over 45K features! We now have the data in a format we want i.e., feature vectors. So, the next step is to train and evaluate a classifier. The code snippet below shows how to do the training and evaluation of a naive bayes classifier with the features we extracted above.
the confusion matrix of this classifier with test data:
the classifier is doing fairly well with identifying the non-relevant articles correctly, only making errors 14% of the time. However, it does not perform well in comparison to the second category i.e., relevance. The category is identified correctly only 42% of the time. An obvious thought may be to collect more data. This is correct and often the most rewarding approach. But in the interest of covering other approaches we assume that we cannot change it or collect additional data. This is not a farfetched assumption — in industry many a time one does not have the luxury of collecting more data. One has to work with what they have. We can think of a few possible reasons for this performance and ways to improve this classifier:
- Since we extracted all possible features, we ended up in a large, sparse feature vector, where most features are too rare and end up being noise. Sparse feature set also makes training hard.
- There are very few examples of relevant articles (~20%) compared to the non-relevant articles (~80%) in the dataset. This class imbalance makes the learning process skewed towards the non-relevant articles category as there are very few examples of “relevant” articles.
- Perhaps we need a better learning algorithm
- Perhaps we need a better pre-processing and feature extraction mechanism
- Perhaps we should look for tuning the classifier’s parameters and hyper-parameters
Let us see how to improve our classification performance by addressing some of the possible reasons. One way to approach Reason 1 is to reduce noise in the feature vectors. The approach in the code example earlier had close to 40,000 features. A large number of features introduce sparsity i.e. most of the features in the feature vector are zero and only a few values are non-zero. This in turn affects the ability of the text classification algorithm to lear. Let us see what happens if we restrict this to 5000 and re-run the training and evaluation process. This requires us to change the CountVectorizer instantiation in the process as shown in the code snippet below and repeating all the steps.
the new confusion matrix with this setting:
Now, clearly, while the average performance seems lower than before, the identification of relevant articles correctly increased by over 20%. At that point, one may wonder whether this is what we want. The answer to that question depends on the problem we are trying to solve. If we care about doing reasonably well with non-relevant article identification, and doing as best as possible with relevant article identification, or do equally on both, we could conclude that reducing the feature vector size was useful for this data set, with Naive Bayes classifier.
Reason 2 in our list was the problem of skew in data towards the majority class. There are several ways to address this. Two typical approaches are oversampling the instances belonging to minority classes or under-sampling the majority class to create a balanced dataset. Imbalanced-Learn is a python library that incorporates some of the sampling methods to address this issue. While we will not delve into the details of this library here, classifiers also have a built in mechanism to address such imbalanced datasets. We will see how to use that in the next subsection, by taking another classifier — Logistic Regression.
Class imbalance is one of the most common reasons for a classifier for not doing well. One must always check if this is the case for their task and address it.
To address reason 3, we now move to other algorithms. We begin with logistic regression.
Unlike naive bayes, which estimates probabilities based on feature occurrence in classes, logistic regression “learns” the weights for individual features based on how important they are to make a classification decision. The goal of logistic regression is to learn a linear separator between classes in the training data with the aim of maximizing the probability of the data. This “learning” of feature weights and probability distribution over all classes is done through a function called “logistic” function, and hence the name, logistic regression
Let us take the 5000 dimensional feature vector from the last step of the naive bayes example and train a logistic regression classifier instead of naive bayes. The code snippet below shows how to use Logistic Regression for this task.
confusion matrix with this approach:
All the examples in this section demonstrate how changes in different steps affected the classification performance, and how to interpret the results. Clearly, we excluded many other possibilities such as: exploring other text classification algorithms, changing different parameters of various classifiers, coming up with better pre-processing methods etc. We leave them as further exercises for the reader, using the notebook as a playground.
the notebook could be found at:
The Data could be found at: