PyTorch: Difference between revisions

(30 intermediate revisions by the same user not shown)

Line 2:

==Installation==

See [https://pytorch.org/get-started/locally/ PyTorch Getting Started]

See [https://pytorch.org/get-started/locally/ PyTorch Getting Started] and [https://pytorch.org/get-started/previous-versions/ PyTorch Previous Versions]

~~<syntaxhighlight lang="bash">~~

~~# If~~ using conda~~, python 3~~.~~5+, and CUDA 10.0 (+ compatible cudnn)~~

I recommend using the conda installation method since it is paired with the correct version of cuda.

~~conda install pytorch torchvision cudatoolkit=10.0 -c pytorch~~

~~</syntaxhighlight>~~

==Getting Started==

* [https://pytorch.org/tutorials/ PyTorch Tutorials]

{{hidden | Example |

import torch

import torch.nn as nn

model = nn.Sequential(nn.Linear(5, 5),nn.ReLU(),nn.Linear(5, 1))

criterion = nn.MSELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training

for epoch in range(epochs):

~~running_loss = 0.0~~

for i, data in enumerate(trainloader):

for i, data in enumerate(trainloader~~, 0~~):

# get the inputs; e.g. data is a list of [inputs, labels]

# get the inputs; data is a list of [inputs, labels]

inputs, labels = data

Line 26:

Line 27:

optimizer.zero_grad()

# forward ~~+ backward + optimize~~

# forward

outputs = ~~net~~(inputs)

outputs = model(inputs)

loss = criterion(outputs, labels)

# backward

loss.backward()

optimizer.step()

</syntaxhighlight>

}}

==Importing Data==

Line 38:

Line 42:

==Usage==

===torch.nn.~~functional~~===

Note that there are several useful functions under <code>torch.nn.functional</code> which is typically imported as <code>F</code>.

[https://pytorch.org/docs/stable/nn.~~functional~~.html ~~PyTorch Documentation~~]

Most neural network layers are actually implemented in functional.

====F.grid_sample====

===torch.meshgrid===

Note that this is transposed compared to <code>np.meshgrid</code>.

===torch.multinomial===

[https://pytorch.org/docs/stable/generated/torch.multinomial.html torch.multinomial]<br>

If you need to sample with a lot of categories and with replacement, it may be faster to use `torch.cumsum` to build a CDF and `torch.searchsorted`.

{{hidden | torch.searchsorted example |

# Create your weights variable.

weights_cdf = torch.cumsum(weights, dim=0)

weights_cdf_max = weights_cdf[0]

sample = torch.searchsorted(weights_cdf,

weights_cdf_max * torch.rand(num_samples))

</syntaxhighlight>

}}

===F.grid_sample===

[https://pytorch.org/docs/stable/nn.functional.html#grid-sample Doc]<br>

This function allows you to perform interpolation on your input tensor.<br>

It is very useful for resizing images or warping images.

==~~Memory Usage~~==

==Building a Model==

Reducing memory usage

To build a model, do the following:

* Create a class extending <code>nn.Module</code>.

* In your class include all other modules you need during init.

** If you have a list of modules, make sure to wrap them in <code>nn.ModuleList</code> or <code>nn.Sequential</code> so they are properly recognized.

* Wrap any parameters for you model in <code>nn.Parameter(weight, requires_grad=True)</code>.

* Write a forward pass for your model.

==Multi-GPU Training==

See [https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html Multi-GPU Examples].

===nn.DataParallel===

The basic idea is to wrap blocks in [https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel <code>nn.DataParallel</code>].

This will automatically duplicate the module across multiple GPUs and split the batch across GPUs during training.

However, doing so causes you to lose access to custom methods and attributes.

To save and load the model, just use <code>model.module.save_state_dict()</code> and <code>model.module.load_state_dict()</code>.

===nn.parallel.DistributedDataParallel===

[https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel nn.parallel.DistributedDataParallel]

[https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead DistributedDataParallel vs DataParallel]

[https://pytorch.org/tutorials/intermediate/ddp_tutorial.html ddp tutorial]

The PyTorch documentation suggests using this instead of <code>nn.DataParallel</code>.

The main difference is this uses multiple processes instead of multithreading to work around the Python Interpreter.

It also supports training on GPUs across multiple ''nodes'', or computers.

Using this is quite a bit more work than nn.DataParallel.

You may want to consider using PyTorch Lightning which abstracts this away.

==Optimizations==

===Reducing GPU memory usage===

* Save loss using [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.item <code>.item()</code>] which returns a standard Python number

* For non-scalar items, use <code>my_var.detach().cpu().numpy()</code>

* <code>detach()</code> ~~deletes~~ the item from the autograd edge

* [https://pytorch.org/docs/stable/autograd.html#torch.Tensor.detach <code>detach()</code>] removes the item from the autograd edge.

* <code>cpu()</code> ~~copies~~ the tensor to the CPU

* [https://pytorch.org/docs/stable/tensors.html?highlight=cpu#torch.Tensor.cpu <code>cpu()</code>] moves the tensor to the CPU.

* <code>numpy()</code> returns a numpy view of the tensor

* [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.numpy <code>numpy()</code>] returns a numpy view of the tensor.

When possible, use functions which return new views of existing tensors rather than making duplicates of tensors:

* [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.permute <code>permute</code>]

* [https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html#torch.Tensor.expand <code>expand</code>] instead of [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.repeat <code>repeat</code>]

* [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view <code>view</code>]

Note that <code>permute</code> does not change the underlying data.

This can result in a minor performance hit which PyTorch will warn you about if you repeatedly use a contiguous tensor with a channels last tensor.

To address this, call [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.contiguous <code>contiguous</code>] on the tensor with the new memory format.

;During inference

* Use `model.eval()`

* Use `with torch.no_grad():`

===Float16===

Float16 uses half the memory of float32.

New Nvidia GPUs also have dedicated hardware instructions called tensor cores to speed up float16 matrix multiplication.

Typically it's best to train using float32 though for stability purposes.

You can do truncate trained models and inference using float16.

Note that [https://en.wikipedia.org/wiki/Bfloat16_floating-point_format <code>bfloat16</code>] is different from IEEE float16. bfloat16 has fewer mantissa bits (8 exp, 7 mantissa) and is used by Google's TPUs. In contrast, float16 has 5 exp and 10 mantissa bits.

==Classification==

In classification, your model outputs a vector of ''logits''.

These are relative scores for each potential output class.

To compute the loss, pass the logits into a [https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html cross-entropy loss].

To compute the accuracy, you can use [https://pytorch.org/docs/stable/generated/torch.argmax.html <code>torch.argmax</code>] to get the top prediction or [https://pytorch.org/docs/stable/generated/torch.topk.html <code>torch.topk</code>] to get the top-k prediction.

==Debugging==

If you get a cuda kernel error, you can rerun with the environment variable <code>CUDA_LAUNCH_BLOCKING=1</code> to get the correct line in the stack trace.

<pre>

CUDA_LAUNCH_BLOCKING=1 python app.py

</pre>

For the following error:

<pre>

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx(...)`

</pre>

First check all your tensor types and shapes.<br>

If you've checked all your tensor shapes and types and you can try running with the environment variable:

<pre>

CUBLAS_WORKSPACE_CONFIG=:0:0

</pre>

References:

* [https://github.com/pytorch/pytorch/issues/54975 https://github.com/pytorch/pytorch/issues/54975]

==TensorBoard==

Line 73:

Line 174:

</syntaxhighlight>

==PyTorch3D==

==Libraries==

A list of useful libraries

===torchvision===

https://pytorch.org/vision/stable/index.html

Official tools for image manipulation such as blur, bounding boxes.

===torchmetrics===

https://torchmetrics.readthedocs.io/en/stable/

Various metrics such as PSNR, SSIM, LPIPS

===PyTorch3D===

[https://github.com/facebookresearch/pytorch3d PyTorch3D] ~~is a~~ library ~~by Facebook AI Research which contains~~ differentiable renderers for meshes and point clouds.

[https://github.com/facebookresearch/pytorch3d PyTorch3D]

~~It is build using custom CUDA kernels~~.

Facebook library with differentiable renderers for meshes and point clouds.

@@ Line 2: / Line 2: @@
 ==Installation==
-See [https://pytorch.org/get-started/locally/ PyTorch Getting Started]
+See [https://pytorch.org/get-started/locally/ PyTorch Getting Started] and [https://pytorch.org/get-started/previous-versions/ PyTorch Previous Versions]
-<syntaxhighlight lang="bash">
-# If using conda, python 3.5+, and CUDA 10.0 (+ compatible cudnn)
+I recommend using the conda installation method since it is paired with the correct version of cuda.
-conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
-</syntaxhighlight>
 ==Getting Started==
 * [https://pytorch.org/tutorials/ PyTorch Tutorials]
+{{hidden | Example |
 <syntaxhighlight lang="python">
 import torch
 import torch.nn as nn
+model = nn.Sequential(nn.Linear(5, 5),nn.ReLU(),nn.Linear(5, 1))
+criterion = nn.MSELoss()
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
 # Training
 for epoch in range(epochs):
-  running_loss = 0.0
+     for i, data in enumerate(trainloader):
-     for i, data in enumerate(trainloader, 0):
+         # get the inputs; e.g. data is a list of [inputs, labels]
-         # get the inputs; data is a list of [inputs, labels]
          inputs, labels = data
@@ Line 26: / Line 27: @@
          optimizer.zero_grad()
-         # forward + backward + optimize
+         # forward
-         outputs = net(inputs)
+         outputs = model(inputs)
          loss = criterion(outputs, labels)
+        # backward
          loss.backward()
          optimizer.step()
 </syntaxhighlight>
+}}
 ==Importing Data==
@@ Line 38: / Line 42: @@
 ==Usage==
-===torch.nn.functional===
+Note that there are several useful functions under <code>torch.nn.functional</code> which is typically imported as <code>F</code>.
-[https://pytorch.org/docs/stable/nn.functional.html PyTorch Documentation]
+Most neural network layers are actually implemented in functional.
-====F.grid_sample====
+===torch.meshgrid===
+Note that this is transposed compared to <code>np.meshgrid</code>.
+===torch.multinomial===
+[https://pytorch.org/docs/stable/generated/torch.multinomial.html torch.multinomial]<br>
+If you need to sample with a lot of categories and with replacement, it may be faster to use `torch.cumsum` to build a CDF and `torch.searchsorted`.
+{{hidden | torch.searchsorted example |
+<syntaxhighlight lang="python">
+# Create your weights variable.
+weights_cdf = torch.cumsum(weights, dim=0)
+weights_cdf_max = weights_cdf[0]
+sample = torch.searchsorted(weights_cdf,
+                            weights_cdf_max * torch.rand(num_samples))
+</syntaxhighlight>
+}}
+===F.grid_sample===
 [https://pytorch.org/docs/stable/nn.functional.html#grid-sample Doc]<br>
 This function allows you to perform interpolation on your input tensor.<br>
 It is very useful for resizing images or warping images.
-==Memory Usage==
+==Building a Model==
-Reducing memory usage
+To build a model, do the following:
+* Create a class extending <code>nn.Module</code>.
+* In your class include all other modules you need during init.
+** If you have a list of modules, make sure to wrap them in <code>nn.ModuleList</code> or <code>nn.Sequential</code> so they are properly recognized.
+* Wrap any parameters for you model in <code>nn.Parameter(weight, requires_grad=True)</code>.
+* Write a forward pass for your model.
+==Multi-GPU Training==
+See [https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html Multi-GPU Examples].
+===nn.DataParallel===
+The basic idea is to wrap blocks in [https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel <code>nn.DataParallel</code>].
+This will automatically duplicate the module across multiple GPUs and split the batch across GPUs during training.
+However, doing so causes you to lose access to custom methods and attributes.
+To save and load the model, just use <code>model.module.save_state_dict()</code> and <code>model.module.load_state_dict()</code>.
+===nn.parallel.DistributedDataParallel===
+[https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel nn.parallel.DistributedDataParallel]
+[https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead DistributedDataParallel vs DataParallel]
+[https://pytorch.org/tutorials/intermediate/ddp_tutorial.html ddp tutorial]
+The PyTorch documentation suggests using this instead of <code>nn.DataParallel</code>.
+The main difference is this uses multiple processes instead of multithreading to work around the Python Interpreter.
+It also supports training on GPUs across multiple ''nodes'', or computers.
+Using this is quite a bit more work than nn.DataParallel.
+You may want to consider using PyTorch Lightning which abstracts this away.
+==Optimizations==
+===Reducing GPU memory usage===
 * Save loss using [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.item <code>.item()</code>] which returns a standard Python number
 * For non-scalar items, use <code>my_var.detach().cpu().numpy()</code>
-* <code>detach()</code> deletes the item from the autograd edge
+* [https://pytorch.org/docs/stable/autograd.html#torch.Tensor.detach <code>detach()</code>] removes the item from the autograd edge.
-* <code>cpu()</code> copies the tensor to the CPU
+* [https://pytorch.org/docs/stable/tensors.html?highlight=cpu#torch.Tensor.cpu <code>cpu()</code>] moves the tensor to the CPU.
-* <code>numpy()</code> returns a numpy view of the tensor
+* [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.numpy <code>numpy()</code>] returns a numpy view of the tensor.
+When possible, use functions which return new views of existing tensors rather than making duplicates of tensors:
+* [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.permute <code>permute</code>]
+* [https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html#torch.Tensor.expand <code>expand</code>] instead of [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.repeat <code>repeat</code>]
+* [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view <code>view</code>]
+Note that <code>permute</code> does not change the underlying data.
+This can result in a minor performance hit which PyTorch will warn you about if you repeatedly use a contiguous tensor with a channels last tensor.
+To address this, call [https://pytorch.org/docs/stable/tensors.html#torch.Tensor.contiguous <code>contiguous</code>] on the tensor with the new memory format.
+;During inference
+* Use `model.eval()`
+* Use `with torch.no_grad():`
+===Float16===
+Float16 uses half the memory of float32.
+New Nvidia GPUs also have dedicated hardware instructions called tensor cores to speed up float16 matrix multiplication.
+Typically it's best to train using float32 though for stability purposes.
+You can do truncate trained models and inference using float16.
+Note that [https://en.wikipedia.org/wiki/Bfloat16_floating-point_format <code>bfloat16</code>] is different from IEEE float16. bfloat16 has fewer mantissa bits (8 exp, 7 mantissa) and is used by Google's TPUs. In contrast, float16 has 5 exp and 10 mantissa bits.
+==Classification==
+In classification, your model outputs a vector of ''logits''.
+These are relative scores for each potential output class.
+To compute the loss, pass the logits into a [https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html cross-entropy loss].
+To compute the accuracy, you can use [https://pytorch.org/docs/stable/generated/torch.argmax.html <code>torch.argmax</code>] to get the top prediction or  [https://pytorch.org/docs/stable/generated/torch.topk.html <code>torch.topk</code>] to get the top-k prediction.
+==Debugging==
+{{see also | Debugging ML Models}}
+If you get a cuda kernel error, you can rerun with the environment variable <code>CUDA_LAUNCH_BLOCKING=1</code> to get the correct line in the stack trace.
+<pre>
+CUDA_LAUNCH_BLOCKING=1 python app.py
+</pre>
+For the following error:
+<pre>
+CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx(...)`
+</pre>
+First check all your tensor types and shapes.<br>
+If you've checked all your tensor shapes and types and you can try running with the environment variable:
+<pre>
+CUBLAS_WORKSPACE_CONFIG=:0:0
+</pre>
+References:
+* [https://github.com/pytorch/pytorch/issues/54975 https://github.com/pytorch/pytorch/issues/54975]
 ==TensorBoard==
@@ Line 73: / Line 174: @@
 </syntaxhighlight>
-==PyTorch3D==
+==Libraries==
+A list of useful libraries
+===torchvision===
+https://pytorch.org/vision/stable/index.html
+Official tools for image manipulation such as blur, bounding boxes.
+===torchmetrics===
+https://torchmetrics.readthedocs.io/en/stable/
+Various metrics such as PSNR, SSIM, LPIPS
+===PyTorch3D===
 {{main | PyTorch3D}}
-[https://github.com/facebookresearch/pytorch3d PyTorch3D] is a library by Facebook AI Research which contains differentiable renderers for meshes and point clouds.
+[https://github.com/facebookresearch/pytorch3d PyTorch3D]
-It is build using custom CUDA kernels.
+Facebook library with differentiable renderers for meshes and point clouds.