NLP Fundamentals — Text Classifier(P2)

A taxonomy of text classification


A Pipeline for Building Text Classification Systems

text classification pine-line
  1. Collect or create a labeled dataset suitable for the task
  2. Split the dataset into two (train and test) or three parts — train, validation (a.k.a development) and test set, decide on evaluation metric(s)
  3. Transform raw text into feature vectors
  4. Train a classifier using the feature vectors and the corresponding labels from the training set
  5. Using the evaluation metric(s) from step 2, benchmark the model performance on the test set
  6. Deploy the model to serve the real-world use case and monitor its performance.

One Pipeline, Many Classifiers

Naive Bayes Classifier

  1. Since we extracted all possible features, we ended up in a large, sparse feature vector, where most features are too rare and end up being noise. Sparse feature set also makes training hard.
  2. There are very few examples of relevant articles (~20%) compared to the non-relevant articles (~80%) in the dataset. This class imbalance makes the learning process skewed towards the non-relevant articles category as there are very few examples of “relevant” articles.
  3. Perhaps we need a better learning algorithm
  4. Perhaps we need a better pre-processing and feature extraction mechanism
  5. Perhaps we should look for tuning the classifier’s parameters and hyper-parameters

Logistic Regression




