Deep Learning Week6 Notes


1. Benefits of depth

\(\text{Consider ReLU MLPs with a single Input/Output, there exists a network }f\) \(\text{ with }D^* \text{ layers, and }2D^* \text{ internal units, such that, for any network }g\text{ with }D\text{ layers of sizes }\{W^{(1)},...,W^{(D)} \}\), $\text{ since } $ $k(g)\leq 2^D \prod_{d=1}^D W^{(d)}: $

\[\begin{align} ||f-g||_1\geq 1-\frac{2^D}{2^{D^*}}\prod_{d=1}^DW^{(d)} \end{align} \]

\(\text{Inparticular, with }g\text{ a single hidden layer netowrk:}\)

\[||f-g||_1\geq 1-2\frac{W^{(1)}}{2^{D^*}} \]

\(\textbf{To approximate }f\textbf{ properly, the width }W^{(1)}\textbf{ of }g\textbf{'s hidden layer has to increase exponentially with }f\textbf{'s depth }D^*.\)

2. Rectifiers

\(\text{The derivative of } tanh\text{ has an exponential tail on both sides and collapses to 0 very quickly, while ReLU keeps the gradient of positive activations unchanged, which often correspond to half of them.}\)

Leaky-ReLU

\[\begin{align} \max(ax,x) \end{align} \]

where \(0\leq a< 1\)

3. Dropout

假设有 \(p\) 的概率被 drop,那么从期望来看:

\[\begin{align} \mathbb{E}(X) = (1-p)X+p\cdot 0 \end{align} \]

因此为了保证期望不变,只需要在 train 的时候,将激活函数 \(\times \frac{1}{1-p}\),然后在 test 的时候保持原网络不变即可。这样的方法被称为 \(\text{Inverted Dropout.}\)

>>> x = torch.full((3, 5), 1.0).requires_grad_()
>>> x
tensor([[ 1., 1., 1., 1., 1.],
        [ 1., 1., 1., 1., 1.],
        [ 1., 1., 1., 1., 1.]])
>>> dropout = nn.Dropout(p = 0.75)
>>> y = dropout(x)
>>> y
tensor([[ 0., 0., 4., 0., 4.],
        [ 0., 4., 4., 4., 0.],
        [ 0., 0., 4., 0., 0.]])
>>> l = y.norm(2, 1).sum()
>>> l.backward()
>>> x.grad
tensor([[ 0.0000, 0.0000, 2.8284, 0.0000, 2.8284],
[ 0.0000, 2.3094, 2.3094, 2.3094, 0.0000],
[ 0.0000, 0.0000, 4.0000, 0.0000, 0.0000]])

\(\text{Simply add dropout layers:}\)

model = nn.Sequential(nn.Linear(10, 100), nn.ReLU(),
                        nn.Dropout(),
                        nn.Linear(100, 50), nn.ReLU(),
                        nn.Dropout(),
                        nn.Linear(50, 2));

4. Batch Normalize

\(\text{Forcing the activation statistics during the forward pass by re-normalizing them}\)
\(\\\)

Motivation:

\(\large\textbf{If the statistics of the activations are not controlled during training, a layer will have to adapt to the changes of the activations computed by the previous layers in addition to making changes to its own output to reduce the loss.}\)

\(\\\)
\(\text{During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.}\)

\(x_b\in \mathbb{R^D},b=1,...,B \text{ are samples in the batch, empirical mean and variance:}\)

\[\begin{align} \hat{m}&=\frac{1}{B}\sum_{b=1}^Bx_b\\ \hat{v}&=\frac{1}{B}\sum_{b=1}^B(x_b-\hat{m})^2 \end{align} \]

\(\text{Then do the normalization:}\)

\[\begin{align} z_b&=\frac{x_b-\hat{m}}{\sqrt{\hat{v}+\epsilon}}\\ y&=\gamma\odot z_b+\beta \end{align} \]

\(\text{where }z_b,y,\gamma,\beta \in \mathbb{R}^D\)

\(\\\)
\(\large\textbf{During inference: }\text{batch normalization shifts and rescales independently each component of the input }x\text{ according to statistics estimated during training}\)

\[\begin{align} y = \gamma\odot \frac{x-\hat{m}}{\sqrt{\hat{v}+\epsilon}}+\beta \end{align} \]

>>> bn = nn.BatchNorm1d(3)
>>> with torch.no_grad():
... bn.bias.copy_(torch.tensor([2., 4., 8.]))
... bn.weight.copy_(torch.tensor([1., 2., 3.]))
...
Parameter containing:
tensor([2., 4., 8.], requires_grad=True)
Parameter containing:
tensor([1., 2., 3.], requires_grad=True)
>>> x = torch.randn(1000, 3)
>>> x = x * torch.tensor([2., 5., 10.]) + torch.tensor([-10., 25., 3.])
>>> x.mean(0)
tensor([-9.9669, 25.0213, 2.4361])
>>> x.std(0)
tensor([1.9063, 5.0764, 9.7474])
>>> y = bn(x)
>>> y.mean(0)
tensor([2.0000, 4.0000, 8.0000], grad_fn=)
>>> y.std(0)
tensor([1.0005, 2.0010, 3.0015], grad_fn=)

5. Layer Normalize

\(\text{Given a single sample }x\in\mathbb{R}^D,\text{ it normalizes the component of }x:\)

\[\begin{align} \mu&=\frac{1}{D}\sum_{d=1}^Dx_d\\ \sigma&=\sqrt{\frac{1}{D}\sum_{d=1}^D(x_d-\mu)^2}\\ \forall d, y_d&=\frac{x_d-\mu}{\sigma} \end{align} \]

6. ResNet

class ResBlock(nn.Module):
    def __init__(self, nb_channels, kernel_size):
        super().__init__()
        self.conv1 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
        padding = (kernel_size-1)//2)
        self.bn1 = nn.BatchNorm2d(nb_channels)
        self.conv2 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
        padding = (kernel_size-1)//2)
        self.bn2 = nn.BatchNorm2d(nb_channels)
    
    def forward(self, x):
        y = self.bn1(self.conv1(x))
        y = F.relu(y)
        y = self.bn2(self.conv2(y))
        y += x
        y = F.relu(y)
        return y

\(\text{Add ResBlock to ResNet:}\)

class ResNet(nn.Module):
    def __init__(self, nb_channels, kernel_size, nb_blocks):
        super().__init__()
        self.conv0 = nn.Conv2d(1, nb_channels, kernel_size = 1)
        self.resblocks = nn.Sequential(
        # A bit of fancy Python
        *(ResBlock(nb_channels, kernel_size) for _ in range(nb_blocks))
        )
        self.avg = nn.AvgPool2d(kernel_size = 28)
        self.fc = nn.Linear(nb_channels, 10)

    def forward(self, x):
        x = F.relu(self.conv0(x))
        x = self.resblocks(x)
        x = F.relu(self.avg(x))
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

\(\textbf{Veit et al. (2016) interpret a residual network as an ensemble, which explains in part its stability}\)

\(e.g. \text{ with 3 blocks we have:}\)

\[\begin{align} x_1 &= x_0+f_1(x_0)\\ x_2 &= x_1+f_2(x_1)\\ x_3 &= x_2+f_3(x_2) \end{align} \]

\(\text{Hence there are 4 paths:}\)

\[\begin{align} x_3 &= x_2+f_3(x_2)\\ &=x_1+f_2(x_1)+f_3(x_1+f_2(x_1))\\ &=x_0+f_1(x_0)+f_2(x_0+f_1(x_0))+f_3(x_0+f_1(x_0)+f_2(x_0+f_1(x_0))) \end{align} \]

  • $\textbf{(1) performance reduction correlates with the number of paths removed from the ensemble, not with the number of blocks removed} $
  • \(\textbf{(2) only gradients through shallow paths matter during train. }\)

\(\\\)
\(\large\textbf{Summarize:}\)

  • \(\text{ReLU to prevent the gradient from vanishing during the backward pass}\)
  • \(\text{Dropout to force a distributed representation}\)
  • \(\text{Batch Normalization to dynamically maintain the statistics of activations}\)
  • \(\text{Identity pass-through to keep a structured gradient and distribute representation}\)
  • \(\text{smart initialization to put the gradient in a good regime}\)

相关