神经网络——前向传播与反向传播公式推导


概述

对于一个最原生的神经网络来说,BP反向传播是整个传播中的重点和核心,也是机器学习中的一个入门算法。下面将以逻辑回归的分类问题为例,从最简单的单个神经元开始引入算法,逐步拓展到神经网络中,包括BP链式求导、前向传播和反向传播的向量化。最后利用Python语言实现一个原始的神经网络结构作为练习。

需了解的预备知识:代价函数、逻辑回归、线性代数、多元函数微分
参考:《ML-AndrewNg》

神经元

单个神经元是最简单的神经网络,仅有一层输入,一个输出。

【前向传播】

\[z={{w}_{1}}\cdot {{x}_{1}}+{{w}_{2}}\cdot {{x}_{2}}+{{w}_{3}}\cdot {{x}_{3}}+b=\left[ \begin{matrix} {{w}_{1}} & {{w}_{2}} & {{w}_{3}} \\ \end{matrix} \right]\left[ \begin{matrix} {{x}_{1}} \\ {{x}_{2}} \\ {{x}_{3}} \\ \end{matrix} \right]+b \]

若激活函数为sigmoid函数,则

\[\hat{y}=a =sigmoid(z) \]

【代价函数】

\[J(W,b)=-[y\log (a)+(1-y)\log (1-a)] \]

【计算delta】

\[deleta=a-y=\hat{y}-y \]

【计算偏导数】

\[\frac{\partial J}{\partial w}=\frac{\partial J}{\partial a}\cdot \frac{\partial a}{\partial z}=-(\frac{y}{a}+\frac{y-1}{1-a})a(a-1)=(a-y)x^T=(\hat{y}-y)x^T \]

\[\frac{\partial J}{\partial b}=\frac{\partial J}{\partial a}\cdot \frac{\partial a}{\partial z}\cdot \frac{\partial z}{\partial b}=a-y=\hat{y}-y \]

【更新权重】

\[w = w-\alpha*\frac{\partial J}{\partial w} \]

\[b = b-\alpha*\frac{\partial J}{\partial b} \]

拓展到神经网络

假设网络结构如上图所示,输入特征为2个,输出为一个三分类,隐藏层单元数均为4,数据如列表中所示,共20条。

【输入、输出矩阵及初始化权重】

\[X={{\left[ \begin{matrix} {{x}_{11}} & {{x}_{12}} & ... & {{x}_{1,20}} \\ {{x}_{21}} & {{x}_{22}} & ... & {{x}_{2,20}} \\ \end{matrix} \right]}_{2\times 20}} Y={{\left[ \begin{matrix} 1 & 0 & ... & 1 \\ 0 & 1 & ... & 0 \\ 0 & 0 & ... & 0 \\ \end{matrix} \right]}_{3\times 20}} \]

\[{{W}^{(1)}}={{\left[ \begin{matrix} w_{11}^{(1)} & w_{12}^{(1)} \\ w_{21}^{(1)} & w_{22}^{(1)} \\ w_{31}^{(1)} & w_{32}^{(1)} \\ w_{41}^{(1)} & w_{42}^{(1)} \\ \end{matrix} \right]}_{4\times 2}} {{W}^{(2)}}={{\left[ \begin{matrix} w_{11}^{(2)} & w_{12}^{(2)} & w_{13}^{(2)} & w_{14}^{(2)} \\ w_{21}^{(2)} & w_{22}^{(2)} & w_{23}^{(2)} & w_{24}^{(2)} \\ w_{31}^{(2)} & w_{32}^{(2)} & w_{33}^{(2)} & w_{34}^{(2)} \\ w_{41}^{(2)} & w_{42}^{(2)} & w_{43}^{(2)} & w_{44}^{(2)} \\ \end{matrix} \right]}_{4\times 4}} {{W}^{(3)}}={{\left[ \begin{matrix} w_{11}^{(3)} & w_{12}^{(3)} & w_{13}^{(3)} & w_{14}^{(3)} \\ w_{21}^{(3)} & w_{22}^{(3)} & w_{23}^{(3)} & w_{24}^{(3)} \\ w_{31}^{(3)} & w_{32}^{(3)} & w_{33}^{(3)} & w_{34}^{(3)} \\ \end{matrix} \right]}_{3\times 4}} \]

【代价函数】

\[\begin{align} & J(W,b)=-\frac{1}{m}\sum\limits_{k=1}^{K}{\left\{ {{\left[ \sum\limits_{i=1}^{m}{{{y}_{i}}\log (a_{i}^{(L)})+(1-{{y}_{i}})\log (1-a_{i}^{(L)})} \right]}_{k}} \right\}}+\frac{\lambda }{2m}(||{{W}^{(1)}}|{{|}^{2}}+||{{W}^{(2)}}|{{|}^{2}}+...+||{{W}^{(L-1)}}|{{|}^{2}}) \\ & (L为总层数,K为输出单元数,此时加入了L2正则化) \\ \end{align} \]

【FP前向传播】

\[\begin{align} & A_{2\times 20}^{(1)}={{X}_{2\times 20}} \\ & Z_{4\times 20}^{(1)}=W_{4\times 2}^{(1)}A_{2\times 20}^{(1)}+{{b}_{1}} \\ & A_{4\times 20}^{(2)}=\sigma (Z_{4\times 20}^{(1)}) \\ & Z_{4\times 20}^{(2)}=W_{4\times 4}^{(2)}A_{4\times 20}^{(2)}+{{b}_{2}} \\ & A_{4\times 20}^{(3)}=\sigma (Z_{4\times 20}^{(2)}) \\ & Z_{3\times 20}^{(3)}=W_{3\times 4}^{(3)}A_{4\times 20}^{(3)}+{{b}_{3}} \\ & A_{3\times 20}^{(4)}=\sigma (Z_{3\times 20}^{(3)}) \\ \end{align} \]

【链式求导】

\[\begin{align} & \frac{\partial J}{{{W}^{(3)}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{W}^{(3)}}}=({{A}^{(4)}}-Y)\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{W}^{(3)}}}={{\delta }_{3}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{W}^{(3)}}} \\ & \frac{\partial J}{\partial {{b}_{3}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{b}_{3}}}=({{A}^{(4)}}-Y)\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{b}_{3}}}={{\delta }_{3}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{b}_{3}}} \\ & \\ & \frac{\partial J}{\partial {{W}^{(2)}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{A}^{(3)}}}\cdot \frac{\partial {{A}^{(3)}}}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{W}^{(2)}}}=\frac{\partial J}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{W}^{(2)}}}={{\delta }_{2}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{W}^{(2)}}} \\ & \frac{\partial J}{\partial {{b}_{2}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{A}^{(3)}}}\cdot \frac{\partial {{A}^{(3)}}}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{b}_{2}}}=\frac{\partial J}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{b}_{2}}}={{\delta }_{2}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{b}_{2}}} \\ & \\ & \frac{\partial J}{\partial {{W}^{(1)}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{A}^{(3)}}}\cdot \frac{\partial {{A}^{(3)}}}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{A}^{(2)}}}\cdot \frac{\partial {{A}^{(2)}}}{\partial {{Z}^{(1)}}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{W}^{(1)}}}=\frac{\partial J}{\partial {{Z}^{(1)}}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{W}^{(1)}}}={{\delta }_{1}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{W}^{(1)}}} \\ & \frac{\partial J}{\partial {{b}_{1}}}=\frac{\partial J}{\partial {{A}^{(4)}}}\cdot \frac{\partial {{A}^{(4)}}}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{A}^{(3)}}}\cdot \frac{\partial {{A}^{(3)}}}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{A}^{(2)}}}\cdot \frac{\partial {{A}^{(2)}}}{\partial {{Z}^{(1)}}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{b}_{1}}}=\frac{\partial J}{\partial {{Z}^{(1)}}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{b}_{1}}}={{\delta }_{1}}\cdot \frac{\partial {{Z}^{(1)}}}{\partial {{b}_{1}}} \\ \end{align} \]

【BP反向传播】

\[\begin{align} & {{\delta }_{3}}=\frac{\partial J}{\partial {{Z}^{(3)}}}={{A}^{(4)}}-Y \\ & {{\delta }_{2}}=\frac{\partial J}{\partial {{Z}^{(3)}}}\cdot \frac{\partial {{Z}^{(3)}}}{\partial {{Z}^{(2)}}}={{W}^{(3)}}^{T}{{\delta }_{3}}\odot \sigma '({{Z}^{(2)}})={{W}^{(3)}}^{T}{{\delta }_{3}}\odot [{{A}^{(3)}}\odot (I-{{A}^{(3)}})] \\ & {{\delta }_{1}}=\frac{\partial J}{\partial {{Z}^{(2)}}}\cdot \frac{\partial {{Z}^{(2)}}}{\partial {{Z}^{(1)}}}={{W}^{(2)T}}{{\delta }_{2}}\odot \sigma '({{Z}^{(1)}})={{W}^{(2)T}}{{\delta }_{2}}\odot [{{A}^{(2)}}\odot (I-{{A}^{(2)}})] \\ & ... \\ & {{\delta }_{l}}={{W}^{(l+1)T}}{{\delta }_{l+1}}\odot \sigma '({{Z}^{(l)}})={{W}^{(l+1)T}}{{\delta }_{l+1}}\odot [{{A}^{(l+1)}}\odot (I-{{A}^{(l+1)}})] \\ & \\ & \frac{\partial J}{{{W}^{(l)}}}={{\delta }_{l}}\frac{\partial {{Z}^{(l)}}}{\partial {{W}^{(l)}}}={{\delta }_{l}}{{A}^{(l)T}} \\ & \frac{\partial J}{{{b}_{l}}}={{\delta }_{l}} \\ \end{align} \]

代码举例

以20个样本为例,加入正则化,进行迭代,不断优化权重,下面是模拟迭代过程及结果。包括了激活函数、特征缩放、前向传播、反向传播等基本过程。

【代码(Python)】

# BP
# 2021.03.10

# import
import numpy as np

# load data加载数据
def loaddata(filename):
    x = []
    y = []
    with open(filename) as f:
        lines = f.readlines()
        for line in lines:
            line = line.strip().split() # 行切片数据
            x.append([float(line[0]), float(line[1])])
            y.append([float(line[2]), float(line[3]), float(line[4])])

    return np.array(x).T, np.array(y).T

# scale 特征缩放
def scale(data):
    min = np.min(data, 1)
    max = np.max(data, 1)
    min_mat = min.reshape(2,1)
    max_mat = max.reshape(2,1)
    scaled = (data - min_mat)/(max_mat - min_mat)

    return scaled

# sigmoid 激活函数
def sigmoid(x):
    y = 1/(1 + np.exp(-x))

    return y

# w,b calc 更新权重
def calc(X, hidden_unit, output_unit, Y):
    '''arg: 输入矩阵,隐藏层单元数,输出层单元数,标签Y'''
    # 参数初始化
    alpha = 0.3         # 学习率
    reg_lambda = 0.08   # 正则化参数
    epoch = 10000       # 迭代次数
    X_dim = X.shape[0]  # 输入特征维度
    m = X.shape[1]      # 样本量

    # W, b 初始化
    W1 = np.random.randn(hidden_unit, X_dim)
    W2 = np.random.randn(hidden_unit, hidden_unit)
    W3 = np.random.randn(output_unit, hidden_unit)
    b1 = np.random.randn(hidden_unit, 1)
    b2 = np.random.randn(hidden_unit, 1)
    b3 = np.random.randn(output_unit, 1)

    for i in range(0, epoch):
        # 前向传播
        A1 = X
        Z1 = W1.dot(A1) + b1
        A2 = sigmoid(Z1)
        Z2 = W2.dot(A2) + b2
        A3 = sigmoid(Z2)
        Z3 = W3.dot(A3) + b3
        A4 = sigmoid(Z3)

        # 计算delta
        delta3 = A4 - Y
        delta2 = W3.T.dot(delta3) * (A3*(1-A3))
        delta1 = W2.T.dot(delta2) * (A2*(1-A2))

        # 链式求导
        dW3 = (1/m) * delta3.dot(A3.T) + (reg_lambda/m)*W3
        db3 = (1/m) * delta3
        dW2 = (1/m) * delta2.dot(A2.T) + (reg_lambda/m)*W2
        db2 = (1/m) * delta2
        dW1 = (1/m) * delta1.dot(A1.T) + (reg_lambda/m)*W1
        db1 = (1/m) * delta1

        # 反向传播 update W,b
        W3 = W3 - alpha * dW3
        b3 = b3 - alpha * db3
        W2 = W2 - alpha * dW2
        b2 = b2 - alpha * db2
        W1 = W1 - alpha * dW1
        b1 = b1 - alpha * db1

    return W1, W2, W3, b1, b2, b3, A4

# main
if __name__ == "__main__":
    filename = "data.txt"
    X, Y = loaddata(filename)   # (2,m), (3,m)
    X = scale(X)
    W1, W2, W3, b1, b2, b3, Y_hat = calc(X, 4, 3, Y)
    print('==============预测值==============')
    print(Y_hat)
    print('==============标签值==============')
    print(Y)

【运行结果】

如图所示,标签值与预测值非常接近,表明迭代后的预测结果与标签对应一致。

相关