Deep Learning Notes 04 | Weight Initialization and Gradient Check for DNN

2022-01-21 349 words 2 minutes

Contents

1. Weight Initialization

A well chosen initialization can:

Speed up the convergence of gradient descent
Increase the odds of gradient descent converging to a lower training (and generalization) error

Gradient Exploding and Vanishing

Training a neural network requires specifying an initial value of the weights. A well chosen initialization method will help learning.

if W > I (Identity Matrix): in DNN, the activations can explode
if W < I: the activations will decrease exponentially and then vanish

Zero Initialization：

this will fail to classify

1
2
3
4
5
6
7
8
9


def initialize_parameters_zeros(layers_dims):

    parameters = {}
    L = len(layers_dims)            # number of layers in the network

    for l in range(1, L):
        parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
    return parameters

Output:

Train Accuracy: 50%

Random Initialization:

Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.

parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1])*10
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


def initialize_parameters_random(layers_dims):

    np.random.seed(3)               # This seed makes sure your "random" numbers will be the as ours
    parameters = {}
    L = len(layers_dims)            # integer representing the number of layers

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1])*10
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))

    return parameters

Output:

Train Accuracy: 83%

He Initialization

This initializes the weights to random values scaled according to a paper by He et al., 2015.

parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1])*math.sqrt(2./layers_dims[l-1])
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))*math.sqrt(2./layers_dims[l-1])

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


def initialize_parameters_he(layers_dims):

    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers
    import math
    for l in range(1, L + 1):

        parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1])*math.sqrt(2./layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))*math.sqrt(2./layers_dims[l-1])

    return parameters

Output:

Train Accuracy: 99%

2. Gradient Check

Don’t use in training - ONLY to debug
If algorithm fails Gradient Check, look at components to try to identify bug
Remember Regularization
Doesn’t work with Dropout
Run at Random Initialization; perhaps again after some training