神经网络——前向传播与反向传播公式推导
概述
对于一个最原生的神经网络来说,BP反向传播是整个传播中的重点和核心,也是机器学习中的一个入门算法。下面将以逻辑回归的分类问题为例,从最简单的单个神经元开始引入算法,逐步拓展到神经网络中,包括BP链式求导、前向传播和反向传播的向量化。最后利用Python语言实现一个原始的神经网络结构作为练习。
需了解的预备知识:代价函数、逻辑回归、线性代数、多元函数微分
参考:《ML-AndrewNg》
神经元
单个神经元是最简单的神经网络,仅有一层输入,一个输出。
【前向传播】
\[z={{w}_{1}}\cdot {{x}_{1}}+{{w}_{2}}\cdot {{x}_{2}}+{{w}_{3}}\cdot {{x}_{3}}+b=\left[ \begin{matrix} {{w}_{1}} & {{w}_{2}} & {{w}_{3}} \\ \end{matrix} \right]\left[ \begin{matrix} {{x}_{1}} \\ {{x}_{2}} \\ {{x}_{3}} \\ \end{matrix} \right]+b \]若激活函数为sigmoid函数,则
\[\hat{y}=a =sigmoid(z) \]【代价函数】
\[J(W,b)=-[y\log (a)+(1-y)\log (1-a)] \]【计算delta】
\[deleta=a-y=\hat{y}-y \]【计算偏导数】
\[\frac{\partial J}{\partial w}=\frac{\partial J}{\partial a}\cdot \frac{\partial a}{\partial z}=-(\frac{y}{a}+\frac{y-1}{1-a})a(a-1)=(a-y)x^T=(\hat{y}-y)x^T \]\[\frac{\partial J}{\partial b}=\frac{\partial J}{\partial a}\cdot \frac{\partial a}{\partial z}\cdot \frac{\partial z}{\partial b}=a-y=\hat{y}-y \]【更新权重】
\[w = w-\alpha*\frac{\partial J}{\partial w} \]\[b = b-\alpha*\frac{\partial J}{\partial b} \]拓展到神经网络
假设网络结构如上图所示,输入特征为2个,输出为一个三分类,隐藏层单元数均为4,数据如列表中所示,共20条。
【输入、输出矩阵及初始化权重】
\[X={{\left[ \begin{matrix} {{x}_{11}} & {{x}_{12}} & ... & {{x}_{1,20}} \\ {{x}_{21}} & {{x}_{22}} & ... & {{x}_{2,20}} \\ \end{matrix} \right]}_{2\times 20}} Y={{\left[ \begin{matrix} 1 & 0 & ... & 1 \\ 0 & 1 & ... & 0 \\ 0 & 0 & ... & 0 \\ \end{matrix} \right]}_{3\times 20}} \]\[{{W}^{(1)}}={{\left[ \begin{matrix} w_{11}^{(1)} & w_{12}^{(1)} \\ w_{21}^{(1)} & w_{22}^{(1)} \\ w_{31}^{(1)} & w_{32}^{(1)} \\ w_{41}^{(1)} & w_{42}^{(1)} \\ \end{matrix} \right]}_{4\times 2}} {{W}^{(2)}}={{\left[ \begin{matrix} w_{11}^{(2)} & w_{12}^{(2)} & w_{13}^{(2)} & w_{14}^{(2)} \\ w_{21}^{(2)} & w_{22}^{(2)} & w_{23}^{(2)} & w_{24}^{(2)} \\ w_{31}^{(2)} & w_{32}^{(2)} & w_{33}^{(2)} & w_{34}^{(2)} \\ w_{41}^{(2)} & w_{42}^{(2)} & w_{43}^{(2)} & w_{44}^{(2)} \\ \end{matrix} \right]}_{4\times 4}} {{W}^{(3)}}={{\left[ \begin{matrix} w_{11}^{(3)} & w_{12}^{(3)} & w_{13}^{(3)} & w_{14}^{(3)} \\ w_{21}^{(3)} & w_{22}^{(3)} & w_{23}^{(3)} & w_{24}^{(3)} \\ w_{31}^{(3)} & w_{32}^{(3)} & w_{33}^{(3)} & w_{34}^{(3)} \\ \end{matrix} \right]}_{3\times 4}} \]【代价函数】
\[\begin{align} & J(W,b)=-\frac{1}{m}\sum\limits_{k=1}^{K}{\left\{ {{\left[ \sum\limits_{i=1}^{m}{{{y}_{i}}\log (a_{i}^{(L)})+(1-{{y}_{i}})\log (1-a_{i}^{(L)})} \right]}_{k}} \right\}}+\frac{\lambda }{2m}(||{{W}^{(1)}}|{{|}^{2}}+||{{W}^{(2)}}|{{|}^{2}}+...+||{{W}^{(L-1)}}|{{|}^{2}}) \\ & (L为总层数,K为输出单元数,此时加入了L2正则化) \\ \end{align} \]【FP前向传播】
\[\begin{align} & A_{2\times 20}^{(1)}={{X}_{2\times 20}} \\ & Z_{4\times 20}^{(1)}=W_{4\times 2}^{(1)}A_{2\times 20}^{(1)}+{{b}_{1}} \\ & A_{4\times 20}^{(2)}=\sigma (Z_{4\times 20}^{(1)}) \\ & Z_{4\times 20}^{(2)}=W_{4\times 4}^{(2)}A_{4\times 20}^{(2)}+{{b}_{2}} \\ & A_{4\times 20}^{(3)}=\sigma (Z_{4\times 20}^{(2)}) \\ & Z_{3\times 20}^{(3)}=W_{3\times 4}^{(3)}A_{4\times 20}^{(3)}+{{b}_{3}} \\ & A_{3\times 20}^{(4)}=\sigma (Z_{3\times 20}^{(3)}) \\ \end{align} \]【链式求导】
\[\begin{align} & \frac{\partial J}{{{W}^{(3)}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{W}^{(3)}}}=({{A}^{(4)}}-Y)\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{W}^{(3)}}}={{\delta }_{3}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{W}^{(3)}}} \\ & \frac{\partial J}{\partial {{b}_{3}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{b}_{3}}}=({{A}^{(4)}}-Y)\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{b}_{3}}}={{\delta }_{3}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{b}_{3}}} \\ & \\ & \frac{\partial J}{\partial {{W}^{(2)}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{A}^{(3)}}}\cdot \frac{\partial {{A}^{(3)}}}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{W}^{(2)}}}=\frac{\partial J}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{W}^{(2)}}}={{\delta }_{2}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{W}^{(2)}}} \\ & \frac{\partial J}{\partial {{b}_{2}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{A}^{(3)}}}\cdot \frac{\partial {{A}^{(3)}}}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{b}_{2}}}=\frac{\partial J}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{b}_{2}}}={{\delta }_{2}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{b}_{2}}} \\ & \\ & \frac{\partial J}{\partial {{W}^{(1)}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{A}^{(3)}}}\cdot \frac{\partial {{A}^{(3)}}}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{A}^{(2)}}}\cdot \frac{\partial {{A}^{(2)}}}{\partial {{Z}^{(1)}}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{W}^{(1)}}}=\frac{\partial J}{\partial {{Z}^{(1)}}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{W}^{(1)}}}={{\delta }_{1}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{W}^{(1)}}} \\ & \frac{\partial J}{\partial {{b}_{1}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{A}^{(3)}}}\cdot \frac{\partial {{A}^{(3)}}}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{A}^{(2)}}}\cdot \frac{\partial {{A}^{(2)}}}{\partial {{Z}^{(1)}}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{b}_{1}}}=\frac{\partial J}{\partial {{Z}^{(1)}}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{b}_{1}}}={{\delta }_{1}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{b}_{1}}} \\ \end{align} \]【BP反向传播】
\[\begin{align} & {{\delta }_{3}}=\frac{\partial J}{\partial {{Z}^{(3)}}}={{A}^{(4)}}-Y \\ & {{\delta }_{2}}=\frac{\partial J}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{Z}^{(2)}}}={{W}^{(3)}}^{T}{{\delta }_{3}}\odot \sigma '({{Z}^{(2)}})={{W}^{(3)}}^{T}{{\delta }_{3}}\odot [{{A}^{(3)}}\odot (I-{{A}^{(3)}})] \\ & {{\delta }_{1}}=\frac{\partial J}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{Z}^{(1)}}}={{W}^{(2)T}}{{\delta }_{2}}\odot \sigma '({{Z}^{(1)}})={{W}^{(2)T}}{{\delta }_{2}}\odot [{{A}^{(2)}}\odot (I-{{A}^{(2)}})] \\ & ... \\ & {{\delta }_{l}}={{W}^{(l+1)T}}{{\delta }_{l+1}}\odot \sigma '({{Z}^{(l)}})={{W}^{(l+1)T}}{{\delta }_{l+1}}\odot [{{A}^{(l+1)}}\odot (I-{{A}^{(l+1)}})] \\ & \\ & \frac{\partial J}{{{W}^{(l)}}}={{\delta }_{l}}\frac{\partial {{Z}^{(l)}}}{\partial {{W}^{(l)}}}={{\delta }_{l}}{{A}^{(l)T}} \\ & \frac{\partial J}{{{b}_{l}}}={{\delta }_{l}} \\ \end{align} \]代码举例
以20个样本为例,加入正则化,进行迭代,不断优化权重,下面是模拟迭代过程及结果。包括了激活函数、特征缩放、前向传播、反向传播等基本过程。
【代码(Python)】
# BP
# 2021.03.10
# import
import numpy as np
# load data加载数据
def loaddata(filename):
x = []
y = []
with open(filename) as f:
lines = f.readlines()
for line in lines:
line = line.strip().split() # 行切片数据
x.append([float(line[0]), float(line[1])])
y.append([float(line[2]), float(line[3]), float(line[4])])
return np.array(x).T, np.array(y).T
# scale 特征缩放
def scale(data):
min = np.min(data, 1)
max = np.max(data, 1)
min_mat = min.reshape(2,1)
max_mat = max.reshape(2,1)
scaled = (data - min_mat)/(max_mat - min_mat)
return scaled
# sigmoid 激活函数
def sigmoid(x):
y = 1/(1 + np.exp(-x))
return y
# w,b calc 更新权重
def calc(X, hidden_unit, output_unit, Y):
'''arg: 输入矩阵,隐藏层单元数,输出层单元数,标签Y'''
# 参数初始化
alpha = 0.3 # 学习率
reg_lambda = 0.08 # 正则化参数
epoch = 10000 # 迭代次数
X_dim = X.shape[0] # 输入特征维度
m = X.shape[1] # 样本量
# W, b 初始化
W1 = np.random.randn(hidden_unit, X_dim)
W2 = np.random.randn(hidden_unit, hidden_unit)
W3 = np.random.randn(output_unit, hidden_unit)
b1 = np.random.randn(hidden_unit, 1)
b2 = np.random.randn(hidden_unit, 1)
b3 = np.random.randn(output_unit, 1)
for i in range(0, epoch):
# 前向传播
A1 = X
Z1 = W1.dot(A1) + b1
A2 = sigmoid(Z1)
Z2 = W2.dot(A2) + b2
A3 = sigmoid(Z2)
Z3 = W3.dot(A3) + b3
A4 = sigmoid(Z3)
# 计算delta
delta3 = A4 - Y
delta2 = W3.T.dot(delta3) * (A3*(1-A3))
delta1 = W2.T.dot(delta2) * (A2*(1-A2))
# 链式求导
dW3 = (1/m) * delta3.dot(A3.T) + (reg_lambda/m)*W3
db3 = (1/m) * delta3
dW2 = (1/m) * delta2.dot(A2.T) + (reg_lambda/m)*W2
db2 = (1/m) * delta2
dW1 = (1/m) * delta1.dot(A1.T) + (reg_lambda/m)*W1
db1 = (1/m) * delta1
# 反向传播 update W,b
W3 = W3 - alpha * dW3
b3 = b3 - alpha * db3
W2 = W2 - alpha * dW2
b2 = b2 - alpha * db2
W1 = W1 - alpha * dW1
b1 = b1 - alpha * db1
return W1, W2, W3, b1, b2, b3, A4
# main
if __name__ == "__main__":
filename = "data.txt"
X, Y = loaddata(filename) # (2,m), (3,m)
X = scale(X)
W1, W2, W3, b1, b2, b3, Y_hat = calc(X, 4, 3, Y)
print('==============预测值==============')
print(Y_hat)
print('==============标签值==============')
print(Y)
【运行结果】
如图所示,标签值与预测值非常接近,表明迭代后的预测结果与标签对应一致。