Debugging ML Models: Difference between revisions
Tags: Mobile edit Mobile web edit |
|||
| (8 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
Notes on debugging ML models, primarilly CNNs. | Notes on debugging ML models, primarilly CNNs. | ||
Most of this is advice I've found online or gotten through mentors. | Most of this is advice I've found online or gotten through mentors or experience. | ||
==Debugging== | ==Debugging== | ||
| Line 12: | Line 12: | ||
* Make sure there is no activation on the final layer. | * Make sure there is no activation on the final layer. | ||
* If the loss is unstable or increasing, drop the learning rate to <code>O(1e-3)</code> or <code>O(1e-4)</code>. | * If the loss is unstable or increasing, drop the learning rate to <code>O(1e-3)</code> or <code>O(1e-4)</code>. | ||
* Try taking the loss closer to the output of the network. | * Try taking the loss closer to the output of the network. | ||
** If you apply some transformations \(f\) after the output, do \(loss = loss\_fn(f^{-1}(gt), output)\) instead of \(loss = loss\_fn(gt, f(output))\). | |||
** This shortens the paths the gradients need to flow through. | |||
** Note that this may change the per-pixel weights of the loss function. | ** Note that this may change the per-pixel weights of the loss function. | ||
| Line 24: | Line 26: | ||
==Overfitting== | ==Overfitting== | ||
Overfitting occurs when your training | Overfitting occurs when your model begins learning attributes specific to your training data, causing your validation loss to increase. | ||
Historically this was a big concern for ML models and people relied heavily on regularization to address overfitting. | Historically this was a big concern for ML models and people relied heavily on regularization to address overfitting. | ||
Recently though, overfitting has become less of a concern with larger ML models. | Recently though, overfitting has become less of a concern with larger ML models. | ||
| Line 51: | Line 53: | ||
assert all_finite(my_tensor), "my_tensor has NaNs or Infs" | assert all_finite(my_tensor), "my_tensor has NaNs or Infs" | ||
# Or | |||
tf.debugging.assert_all_finite(my_tensor, "my_tensor has NaNs or Infs") | |||
</syntaxhighlight> | </syntaxhighlight> | ||
Typically, you | Typically, you get Infs and NaNs when there is an division by ~0 in the forward or backward pass. | ||
However it is also possible that the learning rate is too high or your model is broken. | However it is also possible that the learning rate is too high or your model is broken. | ||
I typically debug by: | I typically debug by: | ||
| Line 60: | Line 65: | ||
* Checking that the training data has no NaNs or Infs. | * Checking that the training data has no NaNs or Infs. | ||
* Checking that there are no divides anywhere in the code or that all divides are safe. | * Checking that there are no divides anywhere in the code or that all divides are safe. | ||
** See [https://www.tensorflow.org/api_docs/python/tf/math/divide_no_nan <code>tf.math.divide_no_nan</code>]. | |||
* Checking the gradients of trig functions in the code. | * Checking the gradients of trig functions in the code. | ||
| Line 80: | Line 86: | ||
** For Tensorflow see [https://www.tensorflow.org/api_docs/python/tf/clip_by_norm tf.clip_by_norm] and [https://www.tensorflow.org/api_docs/python/tf/clip_by_value tf.clip_by_value]. | ** For Tensorflow see [https://www.tensorflow.org/api_docs/python/tf/clip_by_norm tf.clip_by_norm] and [https://www.tensorflow.org/api_docs/python/tf/clip_by_value tf.clip_by_value]. | ||
* Using a safe divide which forces the denominator to have values with abs > EPS. | * Using a safe divide which forces the denominator to have values with abs > EPS. | ||
** Note that this can cutoff gradients. | |||
==Soft Operations== | |||
The idea of soft operations are to make sure that gradients flow through the entire network rather than one specific path. | |||
One example of this is softmax which allows you to apply gradients using a one-hot encoding. | |||
* Rather than regressing a real value <math>x</math> directly, output a probability distribution. | |||
** Output scores for <math>P(x=j)</math> for some fixed set of <math>j</math>, do softmax, and take the expected value. | |||
** Or output <math>\mu, \sigma</math> and normalize the loss based on <math>\sigma</math>. | |||