NLP Fundamentals — Text Classifier(P2)

Organizing is what you do before you do something, so that when you do it, it is not all mixed up

In this Section we will look at one of the most popular tasks in NLP — text classification. It concerns with assigning one or more groups for a given piece of text, from a larger set of possible groups. It has a wide range of applications across diverse domains such as social media, e-commerce, healthcare, law, marketing, to name a few. A common example of text classification in our daily lives is classifying emails as spam and non-spam. Even though the purpose and application of text classification may vary from domain to domain, the underlying abstract problem remains the same. This invariance of the core problem and its applications in a myriad of domains makes text classification by far the most widely used NLP task in industry and researched in academia. In this chapter, we will discuss the usefulness of text classification, and how to build text classifiers for your use cases, along with some practical tips for real world scenarios.

A taxonomy of text classification


A Pipeline for Building Text Classification Systems

text classification pine-line

One typically follows the following steps in building a text classification system:

  1. Collect or create a labeled dataset suitable for the task
  2. Split the dataset into two (train and test) or three parts — train, validation (a.k.a development) and test set, decide on evaluation metric(s)
  3. Transform raw text into feature vectors
  4. Train a classifier using the feature vectors and the corresponding labels from the training set
  5. Using the evaluation metric(s) from step 2, benchmark the model performance on the test set
  6. Deploy the model to serve the real-world use case and monitor its performance.

One Pipeline, Many Classifiers

Naive Bayes Classifier

The next step is to pre-process the texts and then convert them into feature vectors. While there are many different ways to do the pre-processing, let us say we want to do the following: lowercasing, removal of punctuation, digits and any custom strings, and stopwords. The below code snippet shows this pre-processing and converting the train and test data into feature vectors using CountVectorizer in sklearn

Once you run this in the notebook, you will see that we ended up having a feature vector with over 45K features! We now have the data in a format we want i.e., feature vectors. So, the next step is to train and evaluate a classifier. The code snippet below shows how to do the training and evaluation of a naive bayes classifier with the features we extracted above.

the confusion matrix of this classifier with test data:

the classifier is doing fairly well with identifying the non-relevant articles correctly, only making errors 14% of the time. However, it does not perform well in comparison to the second category i.e., relevance. The category is identified correctly only 42% of the time. An obvious thought may be to collect more data. This is correct and often the most rewarding approach. But in the interest of covering other approaches we assume that we cannot change it or collect additional data. This is not a farfetched assumption — in industry many a time one does not have the luxury of collecting more data. One has to work with what they have. We can think of a few possible reasons for this performance and ways to improve this classifier:

  1. Since we extracted all possible features, we ended up in a large, sparse feature vector, where most features are too rare and end up being noise. Sparse feature set also makes training hard.
  2. There are very few examples of relevant articles (~20%) compared to the non-relevant articles (~80%) in the dataset. This class imbalance makes the learning process skewed towards the non-relevant articles category as there are very few examples of “relevant” articles.
  3. Perhaps we need a better learning algorithm
  4. Perhaps we need a better pre-processing and feature extraction mechanism
  5. Perhaps we should look for tuning the classifier’s parameters and hyper-parameters

Let us see how to improve our classification performance by addressing some of the possible reasons. One way to approach Reason 1 is to reduce noise in the feature vectors. The approach in the code example earlier had close to 40,000 features. A large number of features introduce sparsity i.e. most of the features in the feature vector are zero and only a few values are non-zero. This in turn affects the ability of the text classification algorithm to lear. Let us see what happens if we restrict this to 5000 and re-run the training and evaluation process. This requires us to change the CountVectorizer instantiation in the process as shown in the code snippet below and repeating all the steps.

the new confusion matrix with this setting:

Now, clearly, while the average performance seems lower than before, the identification of relevant articles correctly increased by over 20%. At that point, one may wonder whether this is what we want. The answer to that question depends on the problem we are trying to solve. If we care about doing reasonably well with non-relevant article identification, and doing as best as possible with relevant article identification, or do equally on both, we could conclude that reducing the feature vector size was useful for this data set, with Naive Bayes classifier.

Reason 2 in our list was the problem of skew in data towards the majority class. There are several ways to address this. Two typical approaches are oversampling the instances belonging to minority classes or under-sampling the majority class to create a balanced dataset. Imbalanced-Learn is a python library that incorporates some of the sampling methods to address this issue. While we will not delve into the details of this library here, classifiers also have a built in mechanism to address such imbalanced datasets. We will see how to use that in the next subsection, by taking another classifier — Logistic Regression.

Class imbalance is one of the most common reasons for a classifier for not doing well. One must always check if this is the case for their task and address it.

To address reason 3, we now move to other algorithms. We begin with logistic regression.

Logistic Regression

Let us take the 5000 dimensional feature vector from the last step of the naive bayes example and train a logistic regression classifier instead of naive bayes. The code snippet below shows how to use Logistic Regression for this task.

confusion matrix with this approach:

All the examples in this section demonstrate how changes in different steps affected the classification performance, and how to interpret the results. Clearly, we excluded many other possibilities such as: exploring other text classification algorithms, changing different parameters of various classifiers, coming up with better pre-processing methods etc. We leave them as further exercises for the reader, using the notebook as a playground.

the notebook could be found at:

The Data could be found at:

AI Researcher - NLP Practitioner