Deep Learning meets PyTorch (part-1)

What is PyTorch?

PyTorch is a library for Python programs that facilitates building deep learning projects. It allows deep learning models to be expressed in Python. This approachability and ease of use found early adopters in the research community, and in the years since the library’s release it has grown into one of the most prominent deep learning tools across a broad range of applications.

PyTorch provides a core data structure, the Tensor, that shares similarities with Numpy arrays. From that foundation, a laundry list of features have been built that make it easy to get a project up and running, or an investigation into a new neural network architecture designed and trained. Tensors provide acceleration of mathematical operations (assuming the appropriate hardware is present), and PyTorch has packages for distributed training, worker processes for efficient data loading, and an extensive library of common deep learning functions.

I believe that PyTorch should be the first deep learning library you learn; if it should be the last is a decision I’ll leave to you.

Why PyTorch?

PyTorch is easy to recommend because of its simplicity. Many researchers and practitioners find it easy to learn, use, extend and debug. It’s pythonic, and while like any complicated domain it has caveats and best practices, using the library generally feels familiar to developers who have used Python previously.

For users familiar with NumPy arrays, the PyTorch class will be immediately familiar. Tensor PyTorch feels like NumPy, but with GPU acceleration and automatic computation of gradients, which makes it suitable for calculating backward pass data automatically starting from a forward expression.

A design driver for PyTorch is expressivity, allowing a developer to implement complicated models without undue complexity being imposed by the library (it’s not a framework!). PyTorch arguably offers one of the most seamless translations of ideas into Python code in the deep learning landscape. For this reason, PyTorch has seen widespread adoption in research, as witnessed by the high citation counts in international conferences

PyTorch also has a compelling story for the transition from research and development into production. While it was initially focused on research workflows, PyTorch has been equipped with a high-performance C++ runtime that can be leveraged to deploy models for inference without relying on Python (and hence pay the overhead of Python calls).

Immediate vs. deferred execution

One key differentiator for deep learning libraries is immediate vs. deferred execution. Much of PyTorch’s ease of use is due to how it implements immediate execution, so we’re going to briefly cover how it works here.

Consider the python expression (a**2 + b**2) ** 0.5 that implements the Pythagorean theorem. If we want to execute this expression, we need to have an a b and handy, like so:

Immediate execution like this consumes inputs and produces an output . PyTorch, like Python in general, defaults to immediate execution, (referred to as “eager mode” in the PyTorch documentation). This is useful because if there are problems executing the expression, the Python interpreter, debugger, and similar tools have direct access to the Python objects involved.
Exceptions can be raised directly at the point where the issue occurred

Alternatively, we could define the Pythagorean expression even before knowing what the inputs are, and then use that definition to produce the output once the inputs are available. That callable function that is defined can be used later, repeatedly, with varied inputs.

In the second case, we defined a series of operations to perform, which resulted in an output function. We didn’t actually execute anything until later, when we passed in the inputs. That’s deferred execution. That means that most exceptions will be raised when the function is called, not when it’s defined. For normal Python, like we see here, that’s fine, since the interpreter and debuggers have full access to the Python state at the time the error occurred.

Where things get tricky is when specialized classes are used that have heavy operator overloading, allowing what looks like immediate execution to actually be deferred under the hood. These often look like the following:

Often in libraries that use this form of function definition the operations of squaring and , a b adding, then taking the square root are not recorded as high-level python byte-code. Instead, the point is usually to compile the expression into a static computation graph (a graph of basic operations) that has some advantage over pure-Python (say, compiling the math directly to
CUDA for execution on a GPU for performance reasons).

The fact that the computation graph is built in one place and used in another makes debugging more difficult, as exceptions often lack specificity about what went wrong, and Python debugging tools don’t have any visibility into the intermediate states of the data. Static graphs also don’t usually mix well with standard Python flow control: they are de-facto domain-specific
languages implemented on top of a host language (Python in our case).

Theano and TensorFlow in “graph mode” use deferred execution similar to this. Graph mode is the default mode of operation in TensorFlow 1.0. In contrast, PyTorch, like Chainer or DyNet, sports a “define by run” dynamic graph engine, by which the computation graph is built node by node, as the code is eagerly evaluated.

PyTorch has the batteries included

First off, PyTorch has the “Py” as in Python, but there’s a lot of non-Python code in it. Actually, for performance reasons, most of PyTorch is written in C++ and CUDA, a C++ like language that can be compiled to run with massive parallelism on GPUs. There are ways to run PyTorch directly from C++, and we’ll look into those towards the end of the book since they provide a reliable strategy for deploying models in production. However, most of the time we’ll interact with PyTorch from Python, building models, training them, and using the trained models to solve actual problems (note that it’s perfectly fine to put a model into production using the normal Python API behind a Flask web server, for instance; it all depends on the requirements on
performance and scale).

At the core, PyTorch is a library that provides multidimensional arrays tensors , called in PyTorch parlance and an extensive library of
operations on them, provided by the module. Both tensors and related operations can run torch on the CPU, or on the GPU. Running on the GPU results in massive speedups compared to CPU (especially if we’re willing to pay for a top end GPU), and with PyTorch doing so doesn’t require more than an additional function call or two. The second core thing that PyTorch provides is the ability of tensors to keep track of the operations performed on them and to compute derivatives of an output with respect to any of its inputs analytically via back-propagation. This is provided natively by tensors, and further refined in torch.autograd. We could argue that by having tensors and the autograd-enabled tensor standard library, PyTorch could be used for more than “just” neural networks. Well, we would be correct: PyTorch can be used for physics, rendering, optimization, simulation, modeling — we’re very likely to see PyTorch used in creative ways throughout the spectrum of scientific applications.

But PyTorch is first and foremost a deep learning library, and as such it provides all the building blocks needed to build neural networks and train them

The core PyTorch modules for building neural networks are located in , which torch.nn provides common neural network layers and other architectural components. Fully connected layers, convolutional layers, activation functions, and loss functions can all be found here

Utilities for data loading and handling can be found in torch.util.data. The two main classes we will work with are , which acts as the bridge between your custom data , and a standardized PyTorch. The other class we’ll see a lot of is the DataLoader, which can spawn child processes to load data from a in the background so that it’s ready and waiting for Dataset the training loop as soon as the loop can use it.

In the simplest cast, the model will be running the required calculations on the local CPU or a single GPU, and so once the training loop has the data, computation can start immediately. It’s more common, however, to want to use specialized hardware like multiple GPUs or have multiple machines contribute their resources to training the model. In those cases, torch.nn.DataParallel torch.distributed and the can be employed to leverage the additional hardware available.

Once we have results from running our model on the training data, torch.optim provides standard ways of updating the model so that the output starts to more closely resemble the answers specified in the training data

As mentioned earlier, PyTorch defaults to an immediate execution model (“eager mode”). Whenever an instruction involving PyTorch is executed by the Python interpreter, the corresponding operation is immediately carried out by the underlying C++ or CUDA implementation. More instructions, more operations to be executed. This is typically as fast as it can be on the C++ side, but it incurs in the cost of calling that implementation through Python. It’s a minute cost, but it adds up.

Wrapping up

In this blog I introduced where the world stands with deep learning and what tools one can use to be part of the revolution. We have taken a peek into what PyTorch has to offer and why it is worth investing time and energy in it. And I have described what PyTorch looks like from a bird’s-eye view. As with any good story, wouldn’t it be great to take a peek at the amazing things PyTorch will enable us to do once we’ve completed our journey? Hold tight, the next blog is aimed at exactly that.

AI Researcher - NLP Practitioner