Hyperparameter tuning, Regularization and Optimization 第一周笔记

Practical aspects of Deep Learning

Posted by baiyf on November 17, 2017

Setting up your Machine Learning Application

Bias/Variance trade-off

Bias对应训练集误差,Variance对应验证集误差,举例如下

  high variance high bias high bias&high variance low bias&low variance
Train set error 1% 15% 15% 0.5%
Dev set error 11% 16% 30% 1%

Regularzation

L2 Regularization

L2正则化的方法是在成本函数$J$上加一个正则化参数,对这个整体进行优化.也称为”weight decay”,相比原来梯度下降的速度更快

J(w,b)=1mmi=1L(ˆy(i),y(i))+λ2m||w||22

式中 ||w||22=nxj=1w2j=n[l]i=1n[l1]j=1(w[l]ij)2=WTW

L1正则化的正则化参数是 λ2mnxj=1|wj|

Dropout Regularization

dropout在计算机视觉领域应用较多,防止过拟合.dropout的一大缺点是成本函数J没有明确的定义,所以在调试的时候通常把dropout函数关闭

“Inverted dropout” dropout with layer l=3 keep_prob = 0.8

import numpy as np
d3 = np.rando.rand(a3.shape[0],a3.shape[1])<keep_prob
a3 = np.multiply(a3,d3)
a3 /= keep_prob

Early stopping

在验证集误差开始增大的时候就停止训练,尽管训练误差仍然在减小

正则化输入

通过数据预处理把输入值在各个维度都处理为均值为0,方差为1的标准化输入,可以让成本函数$J$更好优化.

均值归0:

μ=1mmi=1x(i)
x:=xμ

方差缩为1:

σ2=1mmi=1x(i)2
x/=σ2

梯度消失和梯度爆炸

W[l]>I在深度神经网络中将引起梯度爆炸,W[l]<I将引起梯度消失

深度神经网络中的权重初始化

Z=W1X1+W2X2+...+WnXn

为了防止Z过大可以设置Wi=1n

W^[l] = np.random.randn(shape)*np.sqrt(2/n^[l-1])#He 初始化

梯度的估算与检查

梯度估算中双侧峰值比单侧峰值估算更准确

梯度检查

Instructions:

  • First compute “gradapprox” using the formula above (1) and a small value of $\varepsilon$. Here are the Steps to follow:
    1. θ+=θ+ε
    2. θ=θε
    3. J+=J(θ+)
    4. J=J(θ)
    5. gradapprox=J+J2ε
  • Then compute the gradient using backward propagation, and store the result in a variable “grad”
  • Finally, compute the relative difference between “gradapprox” and the “grad” using the following formula: difference=gradgradapprox2grad2+∣∣gradapprox2 You will need 3 Steps to compute this formula:
    • 1’. compute the numerator using np.linalg.norm(…)
    • 2’. compute the denominator. You will need to call np.linalg.norm(…) twice.
    • 3’. divide them.
  • If this difference is small (say less than $10^{-7}$), you can be quite confident that you have computed your gradient correctly. Otherwise, there may be a mistake in the gradient computation.