loop#1 Hyper-parameters, The Lottery Ticket Hypothesis, and Weight&Biases platform

Duy Anh Nguyen
4 min readSep 5, 2020

ML Concept: What are Hyper-parameters?

The goal of ML applications is to create models that can master a task based on a dataset. But how do we know that our models are learning at an optimal rate? To achieve that, we need to regularly tune different aspects of the model and evaluate its performance. Think about the process of learning to play the guitar. With every new melody, we are constantly tuning different aspects of the guitar to make it sound better.

In ML, those aspects that you can fine-tune are known as hyper-parameters.

Conceptually, a hyper-parameter is a parameter that helps control the learning process. They differ from other parameters in the fact that they can’t be directly inferred from the training process. In that sense, hyper-parameters are applied before the algorithm is trained and they govern the entire training process. Some of the most traditional hyper-parameters include:

  • Learning Rate
  • Number of Epochs
  • Hidden Layers
  • Hidden Units
  • Activation Functions

Choosing and optimizing the right hyper-parameter is as boring as it sounds but it plays a key role in the performance of an ML model. As a practitioner, be prepared to spend an annoying among of time fine-tuning and optimizing hyper-parameters.

ML Research that you might need to know: The Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis was published in 2019 by MIT computer scientists Jonathan Frankle and Michael Carbin. It quickly became one of the most important research in the recent years of ML.

The objective: The paper demonstrates that large neural networks contain smaller subnetworks that can be trained to match the performance of the parent network with a much smaller set of parameters.

Why is it so important: The Lottery Ticket Hypothesis challenged the traditional principles of ML optimization. The traditional methods suggest that we should train a large neural network and then use pruning techniques to reduce its size without sacrificing performance. However, the Lottery Ticket Hypothesis suggests that we can find an optimal pruned configuration in subnetworks within the original neural network, reducing the time and resources used for pruning and optimization.

Diving deeper: The core principle of the Lottery Ticket Hypothesis is summarized in the research paper as follows:

“A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.”

The title of the paper is almost as good as its thesis as it draws a comparison between training a large neural network with the process of buying a million tickets to win the lottery. That process is incredibly expensive and might not produce an optimal result. However, if we know that one of the tickets in the pool is the winner, we could develop a process to find it, which is the equivalent of finding the optimal configuration by inspecting pruned subnetworks contained in our original neural network. You might be questioning whether the authors ever play the lottery but, hopefully, you get the point.

If the Lottery Ticket Hypothesis is true, then the next obvious question is to find the strategy to identify the winning ticket. The process for this involves an iterative process of smart training and pruning which can be summarized in the following five steps:

  1. Randomly initialize a neural network.
  2. Train the network until it converges.
  3. Prune a fraction of the network.
  4. To extract the winning ticket, reset the weights of the remaining portion of the network to their values from (1) — the initializations they received before training began.
  5. To evaluate whether the resulting network at step (4) is indeed a winning ticket, train the pruned, untrained network and examine its convergence behavior and accuracy.

The paper suggests that running this process iteratively produces the smallest and more optimal networks.

ML Technology that you might need to know: Weights and Biases

Weights and Biases (W&B) is one of the top platforms in the market that enables the hyper-parameter optimization of ML models.

Why should I know about this: As a data scientist, be prepared to spend a lot of hours optimizing hyper-parameters configurations. The toolsets for this task are still in very early stages and W&B is one of the leaders in this nascent area of the ML markets. In just a few months, W&B capture the attention of AI powerhouses such as OpenAI which adopted the new stack as a key component of their experiments.

What is it: W&B is designed to keep track of ML experiments, evaluate results, and optimize hyper-parameter configurations. The platform provides a toolset that works consistently across different ML frameworks. Functionally, W&B can be divided into four fundamental tools:

  • Dashboard: Track experiments, visualize results
  • Reports: Save and share reproducible findings
  • Sweeps: Optimize models with hyper-parameter tuning
  • Artifacts: Dataset and model versioning, pipeline tracking

Data scientists can integrate W&B into their models by adding a simple script and then use the toolset to run different experiments and optimize the model.

How can I use it: Simply create an account at https://www.wandb.com/, download the Python script, and get going

--

--