Training a CNN

15 important questions on Training a CNN

What does the loss describe about a prediction of a label in the training procedure of a CNN?

It describes how much the prediction of a label differs from the real label.

What does the Mean Square Error (MSE) do in the context of a loss function in the training process of a neural network?


The Mean Square Error loss function describes the distance/how far the prediction of a label is off from the real label. Thus how correct the neural network was at predicting data, and thus how much it should adapt itself (perceived truth?)

Why do we square the Mean Square Error (MSE)?

To penalize more for predictions that are further away, have a greater distance to, the true value
  • Higher grades + faster learning
  • Never study anything twice
  • 100% sure, 100% understanding
Discover Study Smart

What does the sign of the cost of a neuron describe in consideration of the prediction value during the iterative process of training a convolutional neural network?

A positive loss describes the prediction value should decrease, thus the sum of the input values times their weights should decrease.

How does gradient descent work in the context of backpropagation when training a neural network? What is the mathematical formulae for doing so?


Gradient descent finds the partial derivatives to move into the expected direction of the goal, and updates the weights appropriately in this direction. The relevant items of formula of the gradient consists of:
    * Error C = (y^* - y)^2
    * Prediction y^* = \sigma(Z^L)
    * Activation Z^L = \sum(w*x)

What is the default algorithm we have learned to use for backpropagation?

Gradient descent

What can we say about the gradient we apply to the input values of a max-pooling layer during backpropagation? What is the semantic outcome of this gradient resolution?

Gradient for maximum is 1, others are zero. This means, we propagate the loss (cost) to that single neuron only.

How are one-hot vectors applied in the training of a convolutional neural network?

It defines the loss for mispredicting a label in a classification task.

In what classification task should we use one-hot vectors?

When exactly one class (label) applies for any arbitrary input of the data set.

What is a one-hot vector?

A vector to represents the correct label in a classification task.
The vector has exactly one element equal to one and all others zero.

What does the learning rate \eta describe in gradient descent?

The learning rate is the step size of the gradient in a given iteration.
It describes the amount of movement in a given iteration.

What are the trade-offs for the height of the learning rate in a gradient descent?


A higher learning rate is:
  • better at discovering the function space
  • has a higher tendency to evade local minima
  • is more quickly to move to good positions
However, a lower learning rate is require to:
  • move more stable
  • converge to a solution

Name three weight-initialization methods.

* Xavier initialization: Initialize each weight uniformly sampled in [-1/sqrt(n),1/sqrt(n)], n number of inputs
* Normalized Xavier initialization: More suitable for deeper networks, uniformly sampled takes output into consideration, [-6/sqrt(ni+no),6/srqt(ni+no)], ni number of inputs, no number of outputs
* Kaimang initialization: Develop for ReLU, zero bias, Gaussian distribution of mean 0 and stddev sqrt(2/n), n number of inputs

How does Kaimang initialization work

Initialize each weight from a Gaussian distribution, with mean 0 and a standard deviation of .
n the number of inputs.
It has zero bias.

What is the advantage of Kaimang initialization?

It is suitable for ReLU-based neural layers

The question on the page originate from the summary of the following study material:

  • A unique study and practice tool
  • Never study anything twice again
  • Get the grades you hope for
  • 100% sure, 100% understanding
Remember faster, study better. Scientifically proven.
Trustpilot Logo