NLP Fundamentals — Text Classifier(P3)

Word Embeddings

Subword Embeddings and fastText

  • This approach can handle words that did not appear in training data (OOV).
  • The implementation facilitates extremely fast learning on even very large corpora.

Deep Learning for Text Classification

  1. Tokenize the texts and convert them into word index vectors
  2. Pad the text sequences so that all text vectors are of the same length
  3. Map every word index to an embedding vector. We do so by multiplying word index vectors with the embedding matrix. The embedding matrix can either be populated using pre-trained embeddings or be trained for embeddings on this corpus.
  4. Use the output from Step 3 as the input to a neural network architecture.’

CNNs for Text Classification

LSTMs for Text Classification

Text Classification with large pre-trained language models

Learning with No or Less Data, and Adapting to New Domains

Less Training Data: Active Learning and Domain Adaptation

  1. Start with a large, pre-trained language model trained on a large dataset of the source domain (e.g., Wikipedia data).
  2. Fine-tune this model using the target language’s unlabeled data
  3. Train a classifier on the labeled target domain data, by extracting feature representations from the fine-tuned language model from Step 2.

Practical Advice




AI Researcher - NLP Practitioner

Duy Anh Nguyen

AI Researcher - NLP Practitioner

