From David's Wiki
Jump to navigation Jump to search
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

PyTorch is a popular machine learning library developed by Facebook


See PyTorch Getting Started

# If using conda, python 3.5+, and CUDA 10.0 (+ compatible cudnn)
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

Getting Started

import torch
import torch.nn as nn

# Training
for epoch in range(epochs):
  running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)

Importing Data

See Data Loading Tutorial



Note that this is transposed compared to np.meshgrid.


PyTorch Documentation


This function allows you to perform interpolation on your input tensor.
It is very useful for resizing images or warping images.

Building a Model

To build a model, do the following:

  • Create a class extending nn.Module.
  • In your class include all other modules you need during init.
    • If you have a list of modules, make sure to wrap them in nn.ModuleList or nn.Sequential so they are properly recognized.
  • Write a forward pass for your model.

Multi-GPU Training

See Multi-GPU Examples.


The basic idea is to wrap blocks in nn.DataParallel.
This will automatically duplicate the module across multiple GPUs and split the batch across GPUs during training.

However, doing so causes you to lose access to custom methods and attributes.

To save and load the model, just use model.module.save_state_dict() and model.module.load_state_dict().


DistributedDataParallel vs DataParallel ddp tutorial

The PyTorch documentation suggests using this instead of nn.DataParallel. The main difference is this uses multiple processes instead of multithreading to work around the Python Interpreter.
It also supports training on GPUs across multiple nodes, or computers.

Using this is quite a bit more work than nn.DataParallel.
You may want to consider using PyTorch Lightning which abstracts this away.


Reducing GPU memory usage

  • Save loss using .item() which returns a standard Python number
  • For non-scalar items, use my_var.detach().cpu().numpy()
  • detach() removes the item from the autograd edge.
  • cpu() moves the tensor to the CPU.
  • numpy() returns a numpy view of the tensor.

When possible, use functions which return new views of existing tensors rather than making duplicates of tensors:

Note that permute does not change the underlying data.
This can result in a minor performance hit which PyTorch will warn you about if you repeatedly use a contiguous tensor with a channels last tensor.
To address this, call contiguous on the tensor with the new memory format.


Float16 uses half the memory of float32.
New Nvidia GPUs also have dedicated hardware called tensor cores to speed up float16 matrix multiplication.
Typically it's best to train using float32 though for stability purposes.
You can do truncate trained models and inference using float16.

Note that bfloat16 is different from IEEE float16. bfloat16 has fewer mantissa bits (8 exp, 7 mantissa) and is used by Google's TPUs. In contrast, float16 has 5 exp and 10 mantissa bits.


In classification, your model outputs a vector of logits.
These are relative scores for each potential output class.
To compute the loss, pass the logits into a cross-entropy loss.

To compute the accuracy, you can use torch.argmax to get the top prediction or torch.topk to get the top-k prediction.


If you get a cuda kernel error, you can rerun with the environment variable CUDA_LAUNCH_BLOCKING=1 to get the correct line in the stack trace.



See PyTorch Docs: Tensorboard

from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(log_dir="./runs")

# Calculate loss. Increment the step.

writer.add_scalar("train_loss", loss.item(), step)

# Optionally flush e.g. at checkpoints

# Close the writer (will flush)


PyTorch3D is a library by Facebook AI Research which contains differentiable renderers for meshes and point clouds.
It is built using custom CUDA kernels and only runs on Linux.