Training a CNN
15 important questions on Training a CNN
What does the loss describe about a prediction of a label in the training procedure of a CNN?
What does the Mean Square Error (MSE) do in the context of a loss function in the training process of a neural network?
The Mean Square Error loss function describes the distance/how far the prediction of a label is off from the real label. Thus how correct the neural network was at predicting data, and thus how much it should adapt itself (perceived truth?)
Why do we square the Mean Square Error (MSE)?
- Higher grades + faster learning
- Never study anything twice
- 100% sure, 100% understanding
What does the sign of the cost of a neuron describe in consideration of the prediction value during the iterative process of training a convolutional neural network?
How does gradient descent work in the context of backpropagation when training a neural network? What is the mathematical formulae for doing so?
Gradient descent finds the partial derivatives to move into the expected direction of the goal, and updates the weights appropriately in this direction. The relevant items of formula of the gradient consists of:
* Error C = (y^* - y)^2
* Prediction y^* = \sigma(Z^L)
* Activation Z^L = \sum(w*x)
What is the default algorithm we have learned to use for backpropagation?
What can we say about the gradient we apply to the input values of a max-pooling layer during backpropagation? What is the semantic outcome of this gradient resolution?
How are one-hot vectors applied in the training of a convolutional neural network?
In what classification task should we use one-hot vectors?
What is a one-hot vector?
The vector has exactly one element equal to one and all others zero.
What does the learning rate \eta describe in gradient descent?
It describes the amount of movement in a given iteration.
What are the trade-offs for the height of the learning rate in a gradient descent?
A higher learning rate is:
- better at discovering the function space
- has a higher tendency to evade local minima
- is more quickly to move to good positions
- move more stable
- converge to a solution
Name three weight-initialization methods.
* Normalized Xavier initialization: More suitable for deeper networks, uniformly sampled takes output into consideration, [-6/sqrt(ni+no),6/srqt(ni+no)], ni number of inputs, no number of outputs
* Kaimang initialization: Develop for ReLU, zero bias, Gaussian distribution of mean 0 and stddev sqrt(2/n), n number of inputs
How does Kaimang initialization work
n the number of inputs.
It has zero bias.
What is the advantage of Kaimang initialization?
The question on the page originate from the summary of the following study material:
- A unique study and practice tool
- Never study anything twice again
- Get the grades you hope for
- 100% sure, 100% understanding