《神经网络与机器学习》第8章泛化与正则化

神经网络与机器学习

第8章泛化与正则化

神经网络设计中一个重要的问题是如何确定神经元数目，当神经元数目很大时，虽然在测试数据集合上表现非常好，但是在新的数据集合进行测试时却表现很差，我们称为泛化（generalization）能力差，泛化能力就是指在测试数据集合和新的测试数据集合都表现好。原因就是，当神经元数目太多，那么权系数参数维数过大，会造成过拟合现象，从而泛化能力差。

§8.1 奥卡姆剃刀原则

Ockham's razor奥卡姆剃刀原则：英国逻辑学家奥卡姆提出的一个原则是模型越复杂则出错的可能性越大。推广到神经网络设计中，那么具有良好泛化能力的神经网络应该具有最简模型，最少的自由参数。

为了获得最简神经网络模型，一般有生长法、剪枝法、全局搜索、正则化（regularization）、以及提前终止法（early stopping）。这里我们主要讨论后面两种。

问题描述：

训练集合

\[\left \{ (p_1,t_1),(p_2,t_2),\cdots,(p_Q,t_Q) \right \}\]

$p_q$表示的是网络输出，$t_q$是正确的目标$g(p_q)$加上观测误差噪声$\xi_q$

\[t_q=g(p_q)+\xi_q\]

$g$表示未知的函数。我们将神经网络进行训练，设其输出为$a_q$，目的是逼近函数$g(p_q)$而不受到噪声$\xi_q$的影响，使得训练集上的误差平方和最小

\[F(x)=\sum_{q=1}^Q(t_q-a_q)^{\mathrm{T}}(t_q-a_q)\]

下图表示了过拟合（overfitting）一个例子，蓝色的为真实函数曲线，而数据点是函数值加上了干扰，区间为[-3,0]，拟合曲线虽然经过了各个数据观测点，但是与真实的函数值却没有匹配，就是过拟合！

图8.1 过拟合现象以及外推预测错误

这个神经网络还要第二类错误，就是对于没有数据进行训练的区域[0，3]外推预测错误。这类错误是难以避免的，即使我们如图8.2能够泛化较好，避免过拟合现象，依然难以消除外推预测错误。我们这里只是讨论如何提高泛化能力。

图8.2 泛化能力较好但是还是有外推预测错误

提前终止法易于理解，如下图，通常将数据集合70%的数据（o）进行训练，15%（⊕）进行验证，15%进行测试，当然不是一个固定的比例。随着训练的进行，训练次数的增加，网络复杂度也在增加，如果训练达到极小误差之前提前终止，那么网络有效地使用较少的参数，不至于产生过拟合，比如图中在a阶段拟合比较光滑，泛化能力也好，b阶段虽然拟合数据点很好，但是泛化差，验证数据点错误较多。

图8.3 交叉验证性能曲线

§8.2 Tikhonov正则化

当训练样本所包含的信息不能够由自身充分地唯一重构出输入输出映射，产生了机器学习的过拟合可能，那么为了克服这个严重的病态问题，我们在经验代价函数上加上正则化参数控制的正则化代价函数，使得代价函数最小化限制在压缩子集上的超平面重构问题，使得代价函数在训练样本提供的精度和解的光滑程度上进行了折中。正则化理论有经典的Tikhonov正则化和流形正则化。

上述这段说的非常学术化，我们看下面经典的Tikhonov正则化理论与线性回归，能够很好地理解正则化，也能很容易推广到神经网络训练中，主要的改变在于对目标函数的改造。

图8.4 线性回归模型

我们认为响应$d$是期望信号，输入数据$x$是称为回归量，并且不知道二者符合什么确切的函数关系，我们这里假设为线性关系

\[d=w^{\mathrm{T}}x+\epsilon\]

贝叶斯解释

假设$x$和参数向量$w$相互独立，参数向量$w$的信息是期望信号$d$包含的，那么在观测数据$x$条件下，我们将$w$,$d$的联合概率密度写成

\[f(w,d|x)=f(w|d,x)f(d)=f(d|w,x)f(w)\]

$f(w)$假定参数分布，称为先验概率分布（已知）；

$f(d|w,x)$给定x和参数向量w条件下对于期望信号的观测概率密度；

根据贝叶斯原理，联合概率密度还可以表示为下面乘积

$f(w|d,x)$后验概率密度（未知）

$f(d)$期望响应的概率密度

最大似然估计：是求使得对于期望信号d的观测概率密度$f(d|w,x)$最大的参数向量$w$

\[\widehat{w}_{ml}=argmaxf(d|w,x)\]

最大后验估计：是求后验概率密度$f(w|d,x)$已知，对于$w$，因此权向量的后验概率密度函数为

\[f(w|d,x)=\frac{f(d|w,x)f(w)}{f(d)}=\frac{f(d|w,x)f(w)}{\int f(d|w,x)f(w)dw}\propto f(d|w,x)f(w)\]

利用上式求得最大后验估计为

\[\widehat{w}_{map}=argmaxf(d|w,x)f(w)\]

最大后验估计的好处：1充分利用了有关参数向量$w$的所有信息；2是避免了解的非唯一性，整合了先验概率的估计器更加稳定，肯定有解。

例：设各个权系数相互独立，权向量$w$先验分布是高斯分布

\[f(w)=\prod_{n=1}^N \frac{1}{\sqrt{2\pi}\sigma_w}e^{-\frac{w_n^2}{2\sigma_w^2}}=(\frac{1}{\sqrt{2\pi}\sigma_w})^Ne^{-\frac{\left \| w \right \|^2}{2\sigma_w^2}}\]

这里欧式范数L2-范数

\[\left \| w \right \|=\sqrt{\sum w_n^2}\]

设进行了$M$次实验，得到$M$组观测数据$x_m$，主要$x_m$都是$N\times 1$的向量。观测误差为

\[\epsilon_m=d_m-w^{\mathrm{T}}x_m\]

也是高斯分布的，均值为0，方差为$\sigma^2$，那么观测值$d_m$

\[d_m=w^{\mathrm{T}}x_m+\epsilon_m\]

的概率密度函数就是以$w^{\mathrm{T}}x_i$为均值，以$\sigma^2$为方差的联合高斯分布

\[f(d|w,x)=\left ( \frac{1}{\sqrt{2\pi}\sigma} \right )^Me^{- \frac{\sum_{m=1}^M(d_m-w^{\mathrm{T}}x_m)^2}{2\sigma^2}}\]

那么我们可以得到后验概率密度

\[f(w|d,x)\propto f(d|w,x)f(w)=(\frac{1}{\sqrt{2\pi}\sigma_w})^N(\frac{1}{\sqrt{2\pi}\sigma})^Mexp\left [-\frac{\sum_{m=1}^M(d_m-w^{\mathrm{T}}x_m)^2}{2\sigma^2}-\frac{\left \| w \right \|^2}{2\sigma_w^2} \right ]\]

那么求上式概率密度最大，无关的项去掉，并且取对数得到

\[\widehat{w}_{map}=argmax f(d|w,x)f(w)\]

\[=\underset{w}{argmax}\left \{ -\frac{\sum_{m=1}^M(d_m-w^{\mathrm{T}}x_m)^2}{2\sigma^2}-\frac{\left \| w \right \|^2}{2\sigma_w^2} \right \}\]

引入参数

\[\frac{\sigma^2}{\sigma_w^2}=\beta\]

定义二次代价函数$\xi (w)$

\[\xi(w)=\frac{1}{2}\sum_{m=1}^M(d_m-w^{\mathrm{T}}x_m)^2+\frac{\beta}{2}\left \| w \right \|^2\]

最小二乘统计量？

那么$\xi (w)$是参数向量的二次函数，最大后验估计求解可以写成梯度为0，即

\[\triangledown \xi(w)=-\sum_{j=1}^N\sum_{m=1}^M(d_m-w^{\mathrm{T}}x_m)x_j^{\mathrm{T}}+\beta w^{\mathrm{T}}=0\]

这里设时间平均相关矩阵

\[\overline{R}_{xx}=\sum_{j=1}^N\sum_{m=1}^Mx_mx_j^{\mathrm{T}}\]

时间平均相关向量

\[\overline{p}_{dx}=\sum_{j=1}^N\sum_{m=1}^Md_mx_j^{\mathrm{T}}\]

那么得到

\[\widehat{w}_{map}=[\overline{R}_{xx}+\beta I]^{-1}\overline{p}_{dx}\]

*如果$\frac{\sigma^2}{\sigma_w^2}=\beta$，$\sigma_w^2$非常大，$\beta \approx 0$，意味着$w$先验分布在取值范围内一致，几乎没有什么区别，那么最大后验估计就退化为最大似然估计。

工具变量法：实际输入数据再加入适量随机干扰

\[z_i=x_i+\eta_i\]

$\eta_i$与$x_i$无关，相互独立。那么最大似然估计就变成了

\[\widehat{w}_{ml}=[R_{zz}]^{-1}p_{dz}=[R_{xx}+\sigma_{\eta}^2I]^{-1}p_{dx}\]

\[R_{zz}=\sum_{j=1}^N\sum_{i=1}^Nz_iz_j^{\mathrm{T}}=R_{xx}+\sigma_\eta^2I\]

\[p_{dz}=\sum_{i=1}^Nd_iz_i\]

可以看出噪声的加入实质就是起到了稳定器，即正则器的作用。这就是我们主动加入噪声有益的一个证据！到底加入多少噪声比较好啊？优化$\sigma_\eta^2$，太小等于0不好，太大也不会好，存在一个最优范围！就是噪声有益性，或者noise-boosted提升。

正则化

https://blog.csdn.net/zouxy09/article/details/24971995

机器学习中的范数规则化之（一）L0、L1与L2范数

https://blog.csdn.net/zouxy09?type=blog

https://blog.csdn.net/zouxy09/article/details/8537620#comments_15124136

https://blog.csdn.net/joysurf/article/details/102936577

https://www.cnblogs.com/Belter/p/8536939.html

从贝叶斯估计的角度来看，规则化项$\frac{\beta}{2}||w||^2$对应于模型的先验概率。还有个说法就是，规则化是结构风险最小化策略的实现，是在经验风险上加一个正则化项(regularizer)或惩罚项(penalty term)。

一般来说，监督学习可以看做最小化下面的目标函数：

\[w^*=argmax\sum_iL(d_i,f(x_i,w))+\beta \phi(w)\]

其中，第一项$L(d_i,f(x_i,w))$衡量对第i个样本的预测值$f(x_i,w)$和真实的标签$d_i$之间的误差，这个需要拟合训练样本，所以我们希望这一项最小。但正如上面所言，我们不仅要保证训练误差最小，我们更希望我们的模型测试误差小，所以我们需要加上第二项，也就是对参数$w$的规则化函数$\phi(w)$去约束我们的模型尽量的简单，提高泛化能力。

规则函数：范数

无论回归还是机器学习，大部分带参模型的目标函数都和这个类似。对于第一项损失函数，如果是平方函数Square loss，那就是最小二乘（高斯分布下的贝叶斯最大后验）；如果是合页损失函数Hinge Loss$(L(d_i,f(x_i,w))=max(0,1-df(x)))$，那就是著名的支持向量机（SVM）了；如果是exp-Loss，那就是提升法Boosting了；如果是log-Loss，那就是逻辑回归Logistic Regression了；还有交叉熵。不同的损失函数，具有不同的拟合特性，这个也得就具体问题具体分析的。但这里，我们关注规则化函数$\phi(w)$。

规则化函数$\phi(w)$也有很多种选择，一般是模型复杂度的单调递增函数，模型越复杂，规则化值就越大。比如，规则化项可以是模型参数向量的范数。然而，不同的规则化函数$\phi(w)$选择对参数$w$的约束不同，取得的效果也不同，但我们在论文中常见的都聚集在：L0范数、L1范数、L2范数、迹范数、Frobenius范数和核范数等等。

L0范数是指向量中非0的元素的个数。如果我们用L0范数来规则化一个参数矩阵W的话，就是希望W的大部分元素都是0。换句话说，让参数W是稀疏的。与"压缩感知"和"稀疏编码"中稀疏一个意思。L1范数是指向量中各个元素绝对值之和，也叫"稀疏规则算子"（Lasso regularization）。

为什么L1范数会使权值稀疏？它是L0范数的最优凸近似"。实际上，还存在一个更美的回答：任何的规则化算子，如果他在W=0的地方不可微，并且可以分解为一个"求和"的形式，那么这个规则化算子就可以实现稀疏。这说是这么说，W的L1范数是绝对值，|w|在w=0处是不可微，但这还是不够直观。这里因为我们需要和L2范数进行对比分析。

还有一个问题：既然L0可以实现稀疏，为什么不用L0，而要用L1呢？个人理解一是因为L0范数很难优化求解（NP难问题），二是L1范数是L0范数的最优凸近似，而且它比L0范数要容易优化求解而被广泛应用。

参数稀疏有什么好处呢？

1）特征选择(Feature Selection)：

大家对稀疏规则化趋之若鹜的一个关键原因在于它能实现特征的自动选择。一般来说，$x$的大部分元素（也就是特征）都是和最终的输出$d$没有关系或者不提供任何信息的，在最小化目标函数的时候考虑$x$这些额外的特征，虽然可以获得更小的训练误差，但在预测新的样本时，这些没用的信息反而会被考虑，从而干扰了对正确$d$的预测。稀疏规则化算子的引入就是为了完成特征自动选择的光荣使命，它会学习地去掉没有信息的特征，也就是把这些特征对应的权置0。

2）可解释性好(Interpretability)：

另外一个青睐于稀疏的理由是，模型更容易解释。例如患某种病的概率是y，然后我们收集到的数据x是1000维的，也就是我们需要寻找这1000种因素到底是怎么影响患上这种病的概率的通过学习，如果最后学习到的w*就只有很少的非零元素，例如只有5个非零的wi，那么我们就有理由相信，这些对应的特征在患病分析上面提供的信息是巨大的，决策性的。也就是说，患不患这种病只和这5个因素有关。

除了L1范数（Lasso Regression），还有一种更受宠幸的规则化范数是L2范数$||w||^2$。它也不逊于L1范数，在回归里面，有人把有它的回归叫"岭回归"（Ridge Regression），有人也叫它"权值衰减weight decay"。

图8.5 L1和L2范数的优化解释

可以看到，L1-ball 与L2-ball 的不同就在于L1在和每个坐标轴相交的地方都有"角"出现，而目标函数的测地线除非位置摆得非常好，大部分时候都会在角的地方相交。注意到在角的位置就会产生稀疏性，而更高维的时候（想象一下三维的L1-ball 是什么样的？）除了角点以外，还有很多边的轮廓也是既有很大的概率成为第一次相交的地方，又会产生稀疏性。

相比之下，L2-ball 就没有这样的性质，因为没有角，所以第一次相交的地方出现在具有稀疏性的位置的概率就变得非常小了。这就从直观上来解释了为什么L1-regularization 能产生稀疏性，而L2-regularization 不行的原因了。

因此，一句话总结就是：L1会趋向于产生少量的特征，而其他的特征都是0，而L2会选择更多的特征，这些特征都会接近于0。Lasso在特征选择时候非常有用，而Ridge就只是一种规则化而已。

理论解释：设二次风险函数+L2范数规则函数

\[J=\left \| d-w^{\mathrm{T}}X \right \|_2^2+\beta\left \| w \right \|_2^2=(d-w^{\mathrm{T}}X)^{\mathrm{T}}(d-w^{\mathrm{T}}X)+\beta w^{\mathrm{T}}w\]

\[\frac{\partial J}{\partial w}=2X^{\mathrm{T}}Xw-2X^{\mathrm{T}}d+2\beta w=0\]

得出

\[w^*=(X^{\mathrm{T}}X+\beta I)^{-1}X^{\mathrm{T}}d\]

我们可以设计梯度下降法，学习率为$\alpha$，权值的求解为

\[w_{k+1}=w_k-\alpha \frac{\partial E}{\partial w}\Bigl|_{w_k}\]

注意如果规则函数中参数$\beta =0$，那么就是经典的最小二乘解

\[w^o=(X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}}d\]

我们看看$w^*$和$w^o$之间的关系是什么?

\[w^*=(X^{\mathrm{T}}X+\beta I)^{-1}X^{\mathrm{T}}d=(X^{\mathrm{T}}X+\beta I)^{-1}(X^{\mathrm{T}}X)(X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}}d\]

\[w^*=(X^{\mathrm{T}}X+\beta I)^{-1}(X^{\mathrm{T}}X)w^o\]

由于矩阵$X^{\mathrm{T}}X$是实对称阵，必然有正交分解

\[X^{\mathrm{T}}X=V\wedge V^{\mathrm{T}}\]

那么

\[w^*=(V\wedge V^{\mathrm{T}}+\beta VV^{\mathrm{T}})^{-1}(V\wedge V^{\mathrm{T}})w^o\]

\[=V(\wedge +\beta I)^{-1}\wedge V^{\mathrm{T}}w^o\]

\[w^*=V\begin{bmatrix} \frac{\lambda_1}{\lambda_1 +\beta} & 0 & \cdots & 0\\ 0 & \frac{\lambda_2}{\lambda_2 +\beta} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & \frac{\lambda_N}{\lambda_N +\beta} \end{bmatrix}V^{\mathrm{T}}w^o\]

\[V^{\mathrm{T}}w^*=\begin{bmatrix} \frac{\lambda_1}{\lambda_1 +\beta} & 0 & \cdots & 0\\ 0 & \frac{\lambda_2}{\lambda_2 +\beta} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & \frac{\lambda_N}{\lambda_N +\beta} \end{bmatrix}V^{\mathrm{T}}w^o\]

可以看到$V^{\mathrm{T}}$为基的正交变换下，$w^o$系数被缩减$\frac{\lambda_n}{\lambda_n +\beta} $，$\lambda_n$越小则缩减的越大！也可以从凸函数的角度来讲，凸性越强的方向$\lambda_n$越大，缩减小，凸性越弱的方向$\lambda_n$越小，缩减大些，使其凸性强。

图8.5 L2规则化函数+均方损失函数

x=-10:1:10;

y=x;

[X,Y]=meshgrid(x,y);

Z=(X+Y-1).^2;

surf(X,Y,Z)

axis([-10 10 -10 10 0 20])

hold on

Z=(X+Y-1).^2+0.5*(X.^2+Y.^2);

surf(X,Y,Z)

作业：

1 python完成

https://www.cnblogs.com/Belter/p/8536939.html中例。

图8.6 岭回归曲线

2 吴恩达机器学习第5道编程题正则化logistic

https://www.coursera.org/learn/machine-learning

https://github.com/TingNie/Coursera-ML-using-matlab-python

问题是拟合水位和大坝水流量曲线，这里规定

X, y---训练集合

Xval, yval---交叉验证集合（cross validation）

Xtest, ytest---测试集合

图8.7 观测、验证和测试数据

损失函数为

如果是线性回归，那么梯度为

function [J, grad] = linearRegCostFunction(X, y, theta, lambda)

%LINEARREGCOSTFUNCTION Compute cost and gradient for regularized linear

%regression with multiple variables

% [J, grad] = LINEARREGCOSTFUNCTION(X, y, theta, lambda) computes the

% cost of using theta as the parameter for linear regression to fit the

% data points in X and y. Returns the cost in J and the gradient in grad

% Initialize some useful values

m = length(y); % number of training examples

% You need to return the following variables correctly

J = 0;

grad = zeros(size(theta));

% ====================== YOUR CODE HERE ======================

% Instructions: Compute the cost and gradient of regularized linear

% regression for a particular choice of theta.

% You should set J to the cost and grad to the gradient.

% theta.shape=[n, 1]

% X.shape = [m, n], y.shape = [m, 1]

y_pred = X * theta;

J = 1.0 / (2*m) * sum(sum((y_pred - y).^2)) + lambda / (2*m) * sum(sum(theta(2:end,:).^2));

grad = 1.0 / m * X'*(y_pred-y);

grad(2:end, :) = grad(2:end, :) + lambda / m * theta(2:end, :);

训练则是用共轭梯度法 fmincg

function [theta] = trainLinearReg(X, y, lambda)

%TRAINLINEARREG Trains linear regression given a dataset (X, y) and a

%regularization parameter lambda

% [theta] = TRAINLINEARREG (X, y, lambda) trains linear regression using

% the dataset (X, y) and regularization parameter lambda. Returns the

% trained parameters theta.

% Initialize Theta

initial_theta = zeros(size(X, 2), 1);

% Create "short hand" for the cost function to be minimized

costFunction = @(t) linearRegCostFunction(X, y, t, lambda);

% Now, costFunction is a function that takes in only one argument

options = optimset('MaxIter', 200, 'GradObj', 'on');

% Minimize using fmincg

theta = fmincg(costFunction, initial_theta, options);

end

function [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)

% Minimize a continuous differentialble multivariate function. Starting point

% is given by "X" (D by 1), and the function named in the string "f", must

% return a function value and a vector of partial derivatives. The Polack-

% Ribiere flavour of conjugate gradients is used to compute search directions,

% and a line search using quadratic and cubic polynomial approximations and the

% Wolfe-Powell stopping criteria is used together with the slope ratio method

% for guessing initial step sizes. Additionally a bunch of checks are made to

% make sure that exploration is taking place and that extrapolation will not

% be unboundedly large. The "length" gives the length of the run: if it is

% positive, it gives the maximum number of line searches, if negative its

% absolute gives the maximum allowed number of function evaluations. You can

% (optionally) give "length" a second component, which will indicate the

% reduction in function value to be expected in the first line-search (defaults

% to 1.0). The function returns when either its length is up, or if no further

% progress can be made (ie, we are at a minimum, or so close that due to

% numerical problems, we cannot get any closer). If the function terminates

% within a few iterations, it could be an indication that the function value

% and derivatives are not consistent (ie, there may be a bug in the

% implementation of your "f" function). The function returns the found

% solution "X", a vector of function values "fX" indicating the progress made

% and "i" the number of iterations (line searches or function evaluations,

% depending on the sign of "length") used.

% Usage: [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)

% See also: checkgrad

% Permission is granted for anyone to copy, use, or modify these

% programs and accompanying documents for purposes of research or

% education, provided this copyright notice is retained, and note is

% made of any changes that have been made.

% These programs and documents are distributed without any warranty,

% express or implied. As the programs were written for research

% purposes only, they have not been tested to the degree that would be

% advisable in any important application. All use of these programs is

% entirely at the user's own risk.

% [ml-class] Changes Made:

% 1) Function name and argument specifications

% 2) Output display

% Read options

if exist('options', 'var') && ~isempty(options) && isfield(options, 'MaxIter')

length = options.MaxIter;

else

length = 100;

end

RHO = 0.01; % a bunch of constants for line searches

SIG = 0.5; % RHO and SIG are the constants in the Wolfe-Powell conditions

INT = 0.1; % don't reevaluate within 0.1 of the limit of the current bracket

EXT = 3.0; % extrapolate maximum 3 times the current bracket

MAX = 20; % max 20 function evaluations per line search

RATIO = 100; % maximum allowed slope ratio

argstr = ['feval(f, X']; % compose string used to call function

for i = 1:(nargin - 3)

argstr = [argstr, ',P', int2str(i)];

end

argstr = [argstr, ')'];

if max(size(length)) == 2, red=length(2); length=length(1); else red=1; end

S=['Iteration '];

i = 0; % zero the run length counter

ls_failed = 0; % no previous line search has failed

fX = [];

[f1 df1] = eval(argstr); % get function value and gradient

i = i + (length<0); % count epochs?!

s = -df1; % search direction is steepest

d1 = -s'*s; % this is the slope

z1 = red/(1-d1); % initial step is red/(|s|+1)

while i < abs(length) % while not finished

i = i + (length>0); % count iterations?!

X0 = X; f0 = f1; df0 = df1; % make a copy of current values

X = X + z1*s; % begin line search

[f2 df2] = eval(argstr);

i = i + (length<0); % count epochs?!

d2 = df2'*s;

f3 = f1; d3 = d1; z3 = -z1; % initialize point 3 equal to point 1

if length>0, M = MAX; else M = min(MAX, -length-i); end

success = 0; limit = -1; % initialize quanteties

while 1

while ((f2 > f1+z1*RHO*d1) || (d2 > -SIG*d1)) && (M > 0)

limit = z1; % tighten the bracket

if f2 > f1

z2 = z3 - (0.5*d3*z3*z3)/(d3*z3+f2-f3); % quadratic fit

else

A = 6*(f2-f3)/z3+3*(d2+d3); % cubic fit

B = 3*(f3-f2)-z3*(d3+2*d2);

z2 = (sqrt(B*B-A*d2*z3*z3)-B)/A; % numerical error possible - ok!

end

if isnan(z2) || isinf(z2)

z2 = z3/2; % if we had a numerical problem then bisect

end

z2 = max(min(z2, INT*z3),(1-INT)*z3); % don't accept too close to limits

z1 = z1 + z2; % update the step

X = X + z2*s;

[f2 df2] = eval(argstr);

M = M - 1; i = i + (length<0); % count epochs?!

d2 = df2'*s;

z3 = z3-z2; % z3 is now relative to the location of z2

end

if f2 > f1+z1*RHO*d1 || d2 > -SIG*d1

break; % this is a failure

elseif d2 > SIG*d1

success = 1; break; % success

elseif M == 0

break; % failure

end

A = 6*(f2-f3)/z3+3*(d2+d3); % make cubic extrapolation

B = 3*(f3-f2)-z3*(d3+2*d2);

z2 = -d2*z3*z3/(B+sqrt(B*B-A*d2*z3*z3)); % num. error possible - ok!

if ~isreal(z2) || isnan(z2) || isinf(z2) || z2 < 0 % num prob or wrong sign?

if limit < -0.5 % if we have no upper limit

z2 = z1 * (EXT-1); % the extrapolate the maximum amount

else

z2 = (limit-z1)/2; % otherwise bisect

end

elseif (limit > -0.5) && (z2+z1 > limit) % extraplation beyond max?

z2 = (limit-z1)/2; % bisect

elseif (limit < -0.5) && (z2+z1 > z1*EXT) % extrapolation beyond limit

z2 = z1*(EXT-1.0); % set to extrapolation limit

elseif z2 < -z3*INT

z2 = -z3*INT;

elseif (limit > -0.5) && (z2 < (limit-z1)*(1.0-INT)) % too close to limit?

z2 = (limit-z1)*(1.0-INT);

end

f3 = f2; d3 = d2; z3 = -z2; % set point 3 equal to point 2

z1 = z1 + z2; X = X + z2*s; % update current estimates

[f2 df2] = eval(argstr);

M = M - 1; i = i + (length<0); % count epochs?!

d2 = df2'*s;

end % end of line search

if success % if line search succeeded

f1 = f2; fX = [fX' f1]';

fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);

s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2; % Polack-Ribiere direction

tmp = df1; df1 = df2; df2 = tmp; % swap derivatives

d2 = df1'*s;

if d2 > 0 % new slope must be negative

s = -df1; % otherwise use steepest direction

d2 = -s'*s;

end

z1 = z1 * min(RATIO, d1/(d2-realmin)); % slope ratio but max RATIO

d1 = d2;

ls_failed = 0; % this line search did not fail

else

X = X0; f1 = f0; df1 = df0; % restore point from before failed line search

if ls_failed || i > abs(length) % line search failed twice in a row

break; % or we ran out of time, so we give up

end

tmp = df1; df1 = df2; df2 = tmp; % swap derivatives

s = -df1; % try steepest

d1 = -s'*s;

z1 = 1/(1-d1);

ls_failed = 1; % this line search failed

end

if exist('OCTAVE_VERSION')

fflush(stdout);

end

fprintf('\n');

多项式拟合则出现过拟合线性

function [X_poly] = polyFeatures(X, p)

%POLYFEATURES Maps X (1D vector) into the p-th power

% [X_poly] = POLYFEATURES(X, p) takes a data matrix X (size m x 1) and

% maps each example into its polynomial features where

% X_poly(i, :) = [X(i) X(i).^2 X(i).^3 ... X(i).^p];

% You need to return the following variables correctly.

X_poly = zeros(numel(X), p);

% ====================== YOUR CODE HERE ======================

% Instructions: Given a vector X, return a matrix X_poly where the p-th

% column of X contains the values of X to the p-th power.

for i=1:p

X_poly(:, i) = X(:, 1).^i;

end

归一化

function [X_norm, mu, sigma] = featureNormalize(X)

%FEATURENORMALIZE Normalizes the features in X

% FEATURENORMALIZE(X) returns a normalized version of X where

% the mean value of each feature is 0 and the standard deviation

% is 1. This is often a good preprocessing step to do when

% working with learning algorithms.

mu = mean(X);

X_norm = bsxfun(@minus, X, mu);

sigma = std(X_norm);

X_norm = bsxfun(@rdivide, X_norm, sigma);

end

交叉验证程序

function [lambda_vec, error_train, error_val] = validationCurve(X, y, Xval, yval)

%VALIDATIONCURVE Generate the train and validation errors needed to

%plot a validation curve that we can use to select lambda

% [lambda_vec, error_train, error_val] = ...

% VALIDATIONCURVE(X, y, Xval, yval) returns the train

% and validation errors (in error_train, error_val)

% for different values of lambda. You are given the training set (X,

% y) and validation set (Xval, yval).

% Selected values of lambda (you should not change this)

lambda_vec = [0 0.001 0.003 0.01 0.03 0.1 0.3 1 3 10]';

% You need to return these variables correctly.

error_train = zeros(length(lambda_vec), 1);

error_val = zeros(length(lambda_vec), 1);

% ====================== YOUR CODE HERE ======================

% Instructions: Fill in this function to return training errors in

% error_train and the validation errors in error_val. The

% vector lambda_vec contains the different lambda parameters

% to use for each calculation of the errors, i.e,

% error_train(i), and error_val(i) should give

% you the errors obtained after training with

% lambda = lambda_vec(i)

% Note: You can loop over lambda_vec with the following:

% for i = 1:length(lambda_vec)

% lambda = lambda_vec(i);

% % Compute train / val errors when training linear

% % regression with regularization parameter lambda

% % You should store the result in error_train(i)

% % and error_val(i)

% ....

% end

for i = 1:length(lambda_vec)

[theta] = trainLinearReg(X, y, lambda_vec(i));

[J_train, grad] = linearRegCostFunction(X, y, theta, 0);

[J_val, grad] = linearRegCostFunction(Xval, yval, theta, 0);

error_train(i) = J_train;

error_val(i) = J_val;

end

图8.8 交叉验证曲线选择合适的正则化参数

图8.9 正则化参数为3时拟合多项式曲线

3：隐藏层用20个sigmoid神经元，一个线性输出神经元，拟合带噪声正弦信号

W1: [20×1 double]

W2: [1×20 double]

B1: [20×1 double]

B2: -0.1126

运行101次

不如early stoping 20次