NLP Fundamentals — Embedding Words(P4)

Representing discrete types (e.g., words) as dense vectors is at the core of deep learning’s successes in NLP. The terms “representation learning” and “embedding” refer to learning this mapping from one discrete type to a point in the vector space. When the discrete types are words, the dense vector representation is called a word embedding.

Why Learn Embeddings?

Efficiency of Embeddings

By definition, the weight matrix of a Linear layer that accepts as input this one-hot vector must have the same number of rows as the size of the one-hot vector. When you perform the matrix multiplication, the resulting vector is actually just selecting the row indicated by the non zero entry. Based on this observation, we can just skip the multiplication step and instead directly use an integer as an index to retrieve the selected row.

Approaches to Learning Word Embeddings

  • Given a sequence of words, predict the next word. This is also called the language modeling task.
  • Given a sequence of words before and after, predict the missing word.
  • Given a word, predict words that occur within a window, independent of the position.

Of course, this list is not complete, and the choice of the auxiliary task depends on the intuition of the algorithm designer and the computational expense

Example: Learning the Continuous Bag of Words Embeddings

The Frankenstein Dataset

Starting with the raw text file that Project Gutenberg distributes, the preprocessing is minimal: we use NLTK’s Punkt tokenizer to split the text into separate sentences, then each sentence is converted to lowercase and the punctuation is completely removed. This preprocessing allows for us to later split the strings on whitespace in order to retrieve a list of tokens. This preprocessing function is reused from “Example: Classifying Sentiment of Restaurant Reviews”.

The next step is to enumerate the dataset as a sequence of windows so that the CBOW model can be optimized. To do this, we iterate over the list of tokens in each sentence and group them into windows of a specified window size

The final step in constructing the dataset is to split the data into three sets: the training, validation, and test sets. Recall that the training and validation sets are used during model training: the training set is used to update the parameters, and the validation set is used to measure the model’s performance.8 The test set is used at most once to provide a less biased measurement. In this example (and in most examples in this book), we use a split of 70% for the training set, 15% for the validation set, and 15% for the test set.

Vocabulary, Vectorizer, and DataLoader

The CBOWClassifier Model

The Embedding layer that is used here is parameterized primarily by two numbers: the number of embeddings (size of the vocabulary) and the size of the embeddings (embedding dimension). A third argument ispadding_idx. This argument is used as a sentinel value to the Embedding layer for situations like ours where the data points might not all be the same length.

The Training Routine

Model Evaluation and Prediction

Notebook for Practice

Data could be found at:

AI Researcher - NLP Practitioner