NLP Fundamentals — Embedding Words(P4)

Representing discrete types (e.g., words) as dense vectors is at the core of deep learning’s successes in NLP. The terms “representation learning” and “embedding” refer to learning this mapping from one discrete type to a point in the vector space. When the discrete types are words, the dense vector representation is called a word embedding.

Why Learn Embeddings?

Low-dimensional learned dense representations have several benefits over the one-hot and count-based vectors. First, reducing the dimensionality is computationally efficient. Second, the count-based representations result in high-dimensional vectors that redundantly encode similar information along many dimensions, and do not share statistical strength. Third, very high dimensions in the input can result in real problems in machine learning and optimization — a phenomenon that’s often called the curse of dimensionality. Traditionally, to deal with this dimensionality problem, dimensionality reduction approaches like singular value decomposition (SVD) and principal component analysis (PCA) are employed, but somewhat ironically, these approaches do not scale well when dimensionality is on the order of millions (the typical case in NLP). Fourth, representations learned (or fine-tuned) from task-specific data are optimal for the task at hand. With heuristics like TF-IDF or low-dimensional approaches like SVD it is not clear if the optimization objective of the embedding approach is relevant to the task.

Efficiency of Embeddings

To understand how embeddings work, let’s take a look at an example of a one-hot vector multiplying the weight matrix in a Linear layer, as demonstrated in Figure

By definition, the weight matrix of a Linear layer that accepts as input this one-hot vector must have the same number of rows as the size of the one-hot vector. When you perform the matrix multiplication, the resulting vector is actually just selecting the row indicated by the non zero entry. Based on this observation, we can just skip the multiplication step and instead directly use an integer as an index to retrieve the selected row.

Approaches to Learning Word Embeddings

All word embedding methods train with just words (i.e., unlabeled data), but in a supervised fashion. This is possible by constructing auxiliary supervised tasks in which the data is implicitly labeled, with the intuition that a representation that is optimized to solve the auxiliary task will capture many statistical and linguistic properties of the text corpus in order to be generally useful. Here are some examples of such auxiliary tasks:

  • Given a sequence of words, predict the next word. This is also called the language modeling task.
  • Given a sequence of words before and after, predict the missing word.
  • Given a word, predict words that occur within a window, independent of the position.

Of course, this list is not complete, and the choice of the auxiliary task depends on the intuition of the algorithm designer and the computational expense

Example: Learning the Continuous Bag of Words Embeddings

In this example, we walk through one of the most famous models intended to construct and learn general-purpose word embeddings, the Word2Vec Continuous Bag-of-Words (CBOW) model. In this section, when we refer to “the CBOW task” or “the CBOW classification task,” it is implicit that we are constructing a classification task for the purpose of learning CBOW embeddings. The CBOW model is a multiclass classification task represented by scanning over texts of words, creating a context window of words, removing the center word from the context window, and classifying the context window to the missing word. Intuitively, you can think of it like a fill-in-the-blank task. There is a sentence with a missing word, and the model’s job is to figure out what that word should be.

The Frankenstein Dataset

For this example, we will build a text dataset from a digitized version of Mary Shelley’s novel Frankenstein, available via Project Gutenberg. This section walks through the preprocessing; building a PyTorch Dataset class for this text dataset; and finally splitting the dataset into training, validation, and test sets.

Starting with the raw text file that Project Gutenberg distributes, the preprocessing is minimal: we use NLTK’s Punkt tokenizer to split the text into separate sentences, then each sentence is converted to lowercase and the punctuation is completely removed. This preprocessing allows for us to later split the strings on whitespace in order to retrieve a list of tokens. This preprocessing function is reused from “Example: Classifying Sentiment of Restaurant Reviews”.

The next step is to enumerate the dataset as a sequence of windows so that the CBOW model can be optimized. To do this, we iterate over the list of tokens in each sentence and group them into windows of a specified window size

The final step in constructing the dataset is to split the data into three sets: the training, validation, and test sets. Recall that the training and validation sets are used during model training: the training set is used to update the parameters, and the validation set is used to measure the model’s performance.8 The test set is used at most once to provide a less biased measurement. In this example (and in most examples in this book), we use a split of 70% for the training set, 15% for the validation set, and 15% for the test set.

Vocabulary, Vectorizer, and DataLoader

In the CBOW classification task, the pipeline from text to vectorized minibatch is mostly standard. the Vectorizer in this case does not construct one-hot vectors. Instead, a vector of integers representing the indices of the context is constructed and returned. Note that if the number of tokens in the context is less than the max length, the remaining entries are filled with zeros. This can be referred to as padding with zeros, but in practice.

The CBOWClassifier Model

The CBOWClassifier has three essential steps. First, indices representing the words of the context are used with an Embedding layer to create vectors for each word in the context. Second, the goal is to combine the vectors in some way such that it captures the overall context. In this example, we sum over the vectors. However, other options include taking the max, the average, or even using a Multilayer Perceptron on top. Third, the context vector is used with a Linear layer to compute a prediction vector. This prediction vector is a probability distribution over the entire vocabulary. The largest (most probable) value in the prediction vector indicates the likely prediction for the target word—the center word missing from the context.

The Embedding layer that is used here is parameterized primarily by two numbers: the number of embeddings (size of the vocabulary) and the size of the embeddings (embedding dimension). A third argument ispadding_idx. This argument is used as a sentinel value to the Embedding layer for situations like ours where the data points might not all be the same length.

The Training Routine

In this example, the training routine follows the standard we’ve used throughout the book. First, initialize the dataset, vectorizer, model, loss function, and optimizer. Then iterate through the training and validation portions of the dataset for a certain number of epochs, optimizing for loss minimization on the training portion and measuring progress on the validation portion

Model Evaluation and Prediction

The evaluation in this example is based on predicting a target word from a provided word context for each target and context pair in the test set. A correctly classified word means that the model is learning to predict words from contexts. In this example, the model achieves 15% target word classification accuracy on the test set. There are a few reasons why the result is not super high. First, the construction of the CBOW in this example was meant to be illustrative of how one might construct general-purpose embeddings. As such, there are many properties of the original implementation that have been left out because they add complexity unnecessary for learning (but necessary for optimal performance). The second is that the dataset we are using is minuscule — a single book with roughly 70,000 words is not enough data to identify many regularities when training from scratch. In contrast, state-of-the-art embeddings are typically trained on datasets with terabytes of text

Notebook for Practice

Data could be found at:

AI Researcher - NLP Practitioner