In this section we discussed feature engineering techniques using neural networks, such as word-embeddings, character-embeddings. The advantage of using embedding based features is that they create a dense, low-dimensional feature representation instead of the sparse, high-dimensional structure of bag of words/TFIDF and other such features. There are different ways of designing and using features based on neural embeddings
Words and ngrams have been primarily used as features in text classification for a long time. Different ways of vectorizing words have been proposed, and we used one such representation in the past section, using CountVectorizer. In the past few years, neural network based architectures became popular for “learning” word representations, which are known as “word embeddings”. Loading and pre-processing the text data remains a common step. However, instead of vectorizing the texts using bag of words based features, we will now rely on neural embedding models. As mentioned earlier, we will use a pre-trained embedding model. Word2Vec is a popular algorithm. There are several pre-trained word2vec models trained on large corpora that one can download from the internet. Here we will use the one from Google.The following code snippet shows how to load this model into python using gensim.
This is a large model which can be seen as a dictionary where the keys are words in the vocabulary, and the values are their learnt embedding representations. Given a query word, if the word’s embedding is present in the dictionary, it will return the same. How do we use this pre-learnt embedding to represent features? There are multiple ways of doing this. A simple approach is just to average the embeddings for individual words in text. The code snippet below shows a simple function to do this.
Note it uses embeddings only for the words which are present in the dictionary. It ignores the words for which embeddings are absent. Also, note that the above code will give a single vector with DIMENSION(=300) components., We treat the resulting embedding vector as DIMENSION feature values that represent the entire text.
Once the feature engineering is done, the final step is similar to what we did in the previous section i.e., use this feature set and train a classifier.
“A good way to deal with embeddings in production systems is to load them in a in-memory database such as redis and build a cache on top for faster access.”
Notebook for this section could be found at: https://colab.research.google.com/github/ngduyanhece/practicalnlp/blob/master/Ch4/Word2Vec_Example.ipynb
Subword Embeddings and fastText
Word embeddings, as the name indicates, are about word representations. Even off the shelf embeddings seem to work well on a classification task, as we saw earlier. However, if a word in your dataset was not present in the pre-trained model’s vocabulary, how will we get a representation for this word? This problem is popularly known as Out Of Vocabulary (OOV). In our previous example, we just ignored such words from feature extraction. Is there a better way?
fastText embeddings are based on the idea of enriching word embeddings with sub-word level information. Thus, the embedding representation for each word is represented as a sum of the representations of individual character n-grams. While this may seem like a longer process compared to just estimating word level embeddings, this has two advantages:
- This approach can handle words that did not appear in training data (OOV).
- The implementation facilitates extremely fast learning on even very large corpora.
While fastText is a general purpose library to learn the embeddings, it also supports off the shelf text classification by providing end-to-end classifier training and testing. i.e., we don’t have to handle feature extraction separately.
The remaining part of this subsection talks about using fastText classifier  for text classification. We will work with DBPedia dataset . It is a balanced dataset consisting of 14 classes, with 40,000 training and 5000 testing examples per class. Thus, the total size of the dataset is 560,000 training and 70,000 testing data points.
Notebook for this section could be found at:
As of today, fastText is a silver bullet for text classification. It is extremely fast to train and very useful for setting up very strong baselines. The flip side is the model size.
Deep Learning for Text Classification
deep Learning is a family of machine learning algorithms where the learning happens through different kinds of multi-layered neural network architectures. Over the past few years, it has shown remarkable improvements on standard machine learning tasks such as image classification, speech recognition and machine translation. This resulted in a widespread interest in using deep learning for various tasks, including text classification. So far, we have seen how to train different machine learning classifiers, using bag of words and different kinds of embedding representations. Let us now look at how to use deep learning architectures for text classification. Two most commonly used neural network architectures for text classification are: Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Long Short Term Memory (LSTM) networks are a popular form of RNNs. In this section, we will learn how to train CNNs and LSTMs for text classification, using the IMDB sentiment classification dataset. The first step towards training any machine learning or deep learning model is to define a feature representation. This step was relatively straight forward in the approaches we saw so far, with bag of words or embedding vectors. Let us quickly recap the steps involved in converting training and test data into a format suitable for the neural network input layers.
- Tokenize the texts and convert them into word index vectors
- Pad the text sequences so that all text vectors are of the same length
- Map every word index to an embedding vector. We do so by multiplying word index vectors with the embedding matrix. The embedding matrix can either be populated using pre-trained embeddings or be trained for embeddings on this corpus.
- Use the output from Step 3 as the input to a neural network architecture.’
The code snippet below illustrates Steps 1–2 above:
Step 3: If we want to use pre-trained embeddings to convert the train and test data into an embedding matrix, like we did in the earlier examples with word2vec and fastText, we have to download them, and use them to convert our data into the input format for the neural networks. The following code snippet shows an example of how to do this using GloVe embeddings. GloVe embeddings come with multiple dimensionalities, and we chose 100 as our dimension here. The value of dimensionality is a hyper-parameter and one can experiment with other dimensions as well.
Step 4: Now, we are ready to train deep learning models for text classification! Deep learning architectures consist of an input layer, an output layer, and several hidden layers in between these two layers. Depending on the architecture, different hidden layers are used. The input layer for textual input is typically an embedding layer. The output layer, especially in the context of text classification is a softmax layer with categorical output. If we want to train the input layer instead of using pre-trained embeddings, the easiest way is to call the Embedding layer class in Keras, specifying the input and output dimensions. However, since we want to use pre-trained embeddings, we should create a custom embedding layer which uses the embedding matrix we just built. The following code snippet shows you how to do that.
This will serve as the input layer for any neural network we want to use (CNN or LSTM). Now that we know how to pre-process the input and define an input layer, let us move on specifying the rest of the neural network architecture, using CNNs and LSTMs.
CNNs for Text Classification
Let us now look at how to define, train, and evaluate a CNN model for text classification. CNNs typically consist of a series of Convolution and Pooling layers as the hidden layers. In the context of text classification, CNNs can be thought of as learning the most useful bag-of-words/ngrams features, instead of taking the entire collection of words/ngrams as features as we did in earlier in this chapter. Since our dataset has only two classes — positive and negative, the output layer has two outputs, with the softmax activation function. We will define a CNN with 3 convolution-pooling layers using the Sequential model class in Keras, which allows us to specify deep learning models as a sequential stack of layers — one after another. Once the layers and their activation functions are specified, the next task is to define other important parameters such as the optimizer, loss function and the evaluation metric to tune the hyperparameters of the model. Once all this is done, the next step is to train and evaluate the model. The following code snippet shows one way of specifying a CNN architecture for this task using the Python library Keras, and the results with IMDB dataset for this model.
As you can see in this code snippet, we made a lot of choices in specifying the model such as: activation functions, hidden layers, layer sizes, loss function, optimizer, metrics, epochs and batch size. While there are some commonly recommended options for these, there is no consensus on one combination that works best for all data sets and problems. A good approach while building your models is to experiment with different settings (called hyper parameters). Keep in mind that all these decisions come with some associated cost. For example, in practice, we have the number of epochs as 10 or above. But it also increases the amount of time it takes to train the model. Another thing to note is: if you want to train a embedding layer instead of using pre-trained embeddings in this model, the only thing that changes is the line: cnnmodel.add(embedding_layer). Instead of that, we can specify a new embedding layer, for example, as: cnnmodel.add(Embedding(Param1, Param2)). The Figure below shows the code snippet and model performance for the same.
If you run this code in the notebook, you will notice that, in this case, training the embedding layer on our own dataset seems to result in better classification on test data. However, if the training data were substantially small, sticking to the pre-trained embeddings, or using the domain adaptation techniques we will discuss later in this chapter would be a better choice.
LSTMs for Text Classification
LSTMs, and other variants of RNNs in general, have become the goto way of doing neural language modeling in the past few years. This is primarily because language is sequential in nature and RNNs are specialized in working with sequential data. Current word in the sentence depends on its context — words before and after. However, when we model text using CNNs, this crucial fact is not taken into account. RNNs work on the principle of using this context while learning the language representation or a model of language. Hence, they are known to work well for NLP tasks. There are also CNN variants that can take such context into account and CNNs versus RNNs is still an open area of debate. In this section, we will see an example of using RNNs for text classification. Now that we already saw one neural network in action, it is relatively easy to train another! Just replace the convolutional and pooling parts with an LSTM in the above two code examples! The following code snippet shows how to train an LSTM model using the same IMDB dataset for text classification.
You will notice that this code took much longer to run than the CNN example. One needs to note that while LSTMs are more powerful in utilizing the sequential nature of text, they are much more data hungry as compared to CNNs. Thus, the relative lower performance of the LSTM on a dataset need not necessarily be interpreted as a shortcoming of the model itself. It is possible that the amount of data we have is not sufficient to utilize the full potential of an LSTM. As with the case of CNN, several parameters, and hyper parameters play a very important role in the model performance, and it is always a good practice to explore multiple options and compare different models before finalizing on one.
Text Classification with large pre-trained language models
In the past two years, there were great improvements in using neural network based text representations for NLP tasks. We have discussed these under the section “Universal Text Representations”. These representations have been successfully used for text classification in the recent past, by fine tuning the pre-trained models to the given task and dataset. BERT, is a popular model used in this way for text classification. Let us take a look at how to use BERT for text classification, using the IMDB dataset we used earlier in this section. Full code can be accessed in the relevant notebook
Often deep learning based text classifiers are nothing but a condensed representation of the data they were trained on. These models are often as good as the training dataset. Selecting the right dataset becomes all the more important in such cases.
Learning with No or Less Data, and Adapting to New Domains
So far, we have seen examples of training different text classifiers with different text representations. In all these examples, we had a relatively large training dataset available for the task. However, in most real world scenarios, such datasets are not readily available. In other cases, you may have an available annotated dataset, but it might not be large enough to train a good classifier. There can also be cases where you have a large dataset of, say, customer complaints and requests for one product suite, but you are asked to customize your classifier to another product suite, for which we have a very small amount of data i.e., adapting an existing model to a new domain. In this section, let us discuss how to build good classification systems for such scenarios where one has no or little data or one has to adapt to new domain training data.
Less Training Data: Active Learning and Domain Adaptation
In the scenario described earlier when you collected small amounts of data using human annotations or bootstrapping, it may sometimes turn out that the amount of data was too small to build a good classification model. It is also possible that most of the requests we collected belonged to billing, and very few belonged to the other categories — which will result in a highly imbalanced dataset. Asking the agents to spend many hours doing manual annotation is not always feasible. What should we do in such scenarios?
One approach to address such problems is “active learning”, which is primarily about identifying which data points are more crucial to be used as training data. It helps to answer the following question — if you had 1000 data points but could get only 100 of them labelled, which 100 will you choose? What this means is that when it comes to training data, not all data points are equal. Some data points are more important as compared to others in determining the quality of the classifier trained. Active learning converts this into a continuous process.
The first step in active learning involves training the classifier with the available amount of data, and start using it to make predictions on new data. For the data points where the classifier is very unsure of its predictions, send them to human annotators for correct classification. Then, include these data points to the existing training data and re-train the model. This process is repeated until a satisfactory model performance is reached. Tools such as Prodi.gy have active learning solutions implemented for text classification, and support the efficient usage of active learning to create annotated data and text classification models quickly. The basic intuition behind active learning is as follows: the data points where the model is less confident are the data points that contribute most significantly in improving the quality of the model — hence get only those data points labeled.
Now, Imagine a scenario for your customer complaint classifier, where you have a lot of historical data for a range of products. However, you are now asked to tune it to work on a set of newer products. What is potentially challenging in this situation? Typical text classification approaches rely on the vocabulary of the training data. Hence, they are inherently biased towards the kind of language seen in the training data. So, if the new products are of a very different nature (e.g., model is trained on a suite of electronic products, and we are using it with complaints on cosmetic products), the pre-trained classifiers trained on some other source data are unlikely to perform well. However, it is also not realistic to train a new model from scratch on each product or product suite, as we will again run into the problem of insufficient training data. Domain adaptation is a method to address such scenarios, this is also called Transfer Learning. Here in we “transfer” what we learnt from one domain (source) with large amounts of data to another domain (target), with lesser amount of labeled, but large amounts of unlabeled data.
A typical pipeline for domain adaptation in text classification looks as follows:
- Start with a large, pre-trained language model trained on a large dataset of the source domain (e.g., Wikipedia data).
- Fine-tune this model using the target language’s unlabeled data
- Train a classifier on the labeled target domain data, by extracting feature representations from the fine-tuned language model from Step 2.
ULMFit is a popular domain adaptation approach for text classification. In research experiments, it was shown that this approach matches the performance of training from scratch with 10–20 times more training examples with only 100 labeled examples in text classification tasks. When unlabeled data was used to fine tune the pre-trained language model, it matched the performance of using 50–100 times more labeled examples when trained from scratch, on the same text classification tasks. Transfer learning methods are currently an active area of research in NLP. Neither their use for text classification has yet shown dramatic improvements on standard datasets nor are they commonly used in industry setup yet. But we can expect to see this approach yielding better results in the near future.
Establish strong baselines: A common fallacy is to start with a state-of-the-art algorithm. This is especially true in the present era of deep learning, where every day new approaches/algorithms keep coming up. However, it is always good to start with simpler approaches and try to establish strong baselines first. This is useful for three main reasons:
a) Helps you get a better understanding of the problem statement and key challenges.
b) Building a quick MVP helps us get initial feedback from end-users and stakeholders.
c) A state of the art research model may give us only a minor improvement compared to the baseline, but might come with a huge amount of technical debt.
Balanced Training Data: While working with classification, it is very important to have a balanced dataset where all categories have an equal representation. An imbalanced dataset can adversely impact the learning of the algorithm and result in a biased classifier. While we cannot always control this aspect of the training data, there are various techniques to fix class imbalance in the training data. Some of them are: collecting more data, resampling — under sample from majority classes or oversample from minority classes, and weight balancing.
Combining models and humans in the loop: In practical scenarios, it makes sense to combine the outputs of multiple classification models, and hand-crafted rules from domain experts to achieve the best performance for the business. In other cases, it is practical to defer the decision to a human evaluator, if the machine is not sure of its classification decision. Finally, there could also be scenarios where the learnt model has to change with time and newer data. We will discuss some solutions for such scenarios in the last part of the book which focuses on end to end systems.
Make it work, make it better: Building a classification system, is not just about building a model. For most industrial settings, building a model is often just 5–10% of the total project. Rest consists of gathering data, building data pipelines, deployment, testing, monitoring etc. it is always good to build quickly build a model, use it to build a system and then start improvement iterations. This helps you to quickly identify major roadblocks and the parts need the most work, and often it is not the modeling part.