Deep Learning Week6 Notes
1. Benefits of depth
\(\text{Consider ReLU MLPs with a single Input/Output, there exists a network }f\) \(\text{ with }D^* \text{ layers, and }2D^* \text{ internal units, such that, for any network }g\text{ with }D\text{ layers of sizes }\{W^{(1)},...,W^{(D)} \}\), $\text{ since } $ $k(g)\leq 2^D \prod_{d=1}^D W^{(d)}: $
\[\begin{align} ||f-g||_1\geq 1-\frac{2^D}{2^{D^*}}\prod_{d=1}^DW^{(d)} \end{align} \]\(\text{Inparticular, with }g\text{ a single hidden layer netowrk:}\)
\[||f-g||_1\geq 1-2\frac{W^{(1)}}{2^{D^*}} \]\(\textbf{To approximate }f\textbf{ properly, the width }W^{(1)}\textbf{ of }g\textbf{'s hidden layer has to increase exponentially with }f\textbf{'s depth }D^*.\)
2. Rectifiers
\(\text{The derivative of } tanh\text{ has an exponential tail on both sides and collapses to 0 very quickly, while ReLU keeps the gradient of positive activations unchanged, which often correspond to half of them.}\)
Leaky-ReLU
\[\begin{align} \max(ax,x) \end{align} \]where \(0\leq a< 1\)
3. Dropout
假设有 \(p\) 的概率被 drop,那么从期望来看:
\[\begin{align} \mathbb{E}(X) = (1-p)X+p\cdot 0 \end{align} \]因此为了保证期望不变,只需要在 train 的时候,将激活函数 \(\times \frac{1}{1-p}\),然后在 test 的时候保持原网络不变即可。这样的方法被称为 \(\text{Inverted Dropout.}\)
>>> x = torch.full((3, 5), 1.0).requires_grad_()
>>> x
tensor([[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.]])
>>> dropout = nn.Dropout(p = 0.75)
>>> y = dropout(x)
>>> y
tensor([[ 0., 0., 4., 0., 4.],
[ 0., 4., 4., 4., 0.],
[ 0., 0., 4., 0., 0.]])
>>> l = y.norm(2, 1).sum()
>>> l.backward()
>>> x.grad
tensor([[ 0.0000, 0.0000, 2.8284, 0.0000, 2.8284],
[ 0.0000, 2.3094, 2.3094, 2.3094, 0.0000],
[ 0.0000, 0.0000, 4.0000, 0.0000, 0.0000]])
\(\text{Simply add dropout layers:}\)
model = nn.Sequential(nn.Linear(10, 100), nn.ReLU(),
nn.Dropout(),
nn.Linear(100, 50), nn.ReLU(),
nn.Dropout(),
nn.Linear(50, 2));
4. Batch Normalize
\(\text{Forcing the activation statistics during
the forward pass by re-normalizing them}\)
\(\\\)
Motivation:
\(\large\textbf{If the statistics of the activations are not controlled during training, a layer will have to adapt to the changes of the activations computed by the previous layers in addition to making changes to its own output to reduce the loss.}\)
\(\\\)
\(\text{During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.}\)
\(x_b\in \mathbb{R^D},b=1,...,B \text{ are samples in the batch, empirical mean and variance:}\)
\[\begin{align} \hat{m}&=\frac{1}{B}\sum_{b=1}^Bx_b\\ \hat{v}&=\frac{1}{B}\sum_{b=1}^B(x_b-\hat{m})^2 \end{align} \]\(\text{Then do the normalization:}\)
\[\begin{align} z_b&=\frac{x_b-\hat{m}}{\sqrt{\hat{v}+\epsilon}}\\ y&=\gamma\odot z_b+\beta \end{align} \]\(\text{where }z_b,y,\gamma,\beta \in \mathbb{R}^D\)
\(\\\)
\(\large\textbf{During inference: }\text{batch normalization shifts and rescales independently each component of the input }x\text{ according to statistics estimated during training}\)
>>> bn = nn.BatchNorm1d(3)
>>> with torch.no_grad():
... bn.bias.copy_(torch.tensor([2., 4., 8.]))
... bn.weight.copy_(torch.tensor([1., 2., 3.]))
...
Parameter containing:
tensor([2., 4., 8.], requires_grad=True)
Parameter containing:
tensor([1., 2., 3.], requires_grad=True)
>>> x = torch.randn(1000, 3)
>>> x = x * torch.tensor([2., 5., 10.]) + torch.tensor([-10., 25., 3.])
>>> x.mean(0)
tensor([-9.9669, 25.0213, 2.4361])
>>> x.std(0)
tensor([1.9063, 5.0764, 9.7474])
>>> y = bn(x)
>>> y.mean(0)
tensor([2.0000, 4.0000, 8.0000], grad_fn=)
>>> y.std(0)
tensor([1.0005, 2.0010, 3.0015], grad_fn=)
5. Layer Normalize
\(\text{Given a single sample }x\in\mathbb{R}^D,\text{ it normalizes the component of }x:\)
\[\begin{align} \mu&=\frac{1}{D}\sum_{d=1}^Dx_d\\ \sigma&=\sqrt{\frac{1}{D}\sum_{d=1}^D(x_d-\mu)^2}\\ \forall d, y_d&=\frac{x_d-\mu}{\sigma} \end{align} \]6. ResNet
class ResBlock(nn.Module):
def __init__(self, nb_channels, kernel_size):
super().__init__()
self.conv1 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
padding = (kernel_size-1)//2)
self.bn1 = nn.BatchNorm2d(nb_channels)
self.conv2 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
padding = (kernel_size-1)//2)
self.bn2 = nn.BatchNorm2d(nb_channels)
def forward(self, x):
y = self.bn1(self.conv1(x))
y = F.relu(y)
y = self.bn2(self.conv2(y))
y += x
y = F.relu(y)
return y
\(\text{Add ResBlock to ResNet:}\)
class ResNet(nn.Module):
def __init__(self, nb_channels, kernel_size, nb_blocks):
super().__init__()
self.conv0 = nn.Conv2d(1, nb_channels, kernel_size = 1)
self.resblocks = nn.Sequential(
# A bit of fancy Python
*(ResBlock(nb_channels, kernel_size) for _ in range(nb_blocks))
)
self.avg = nn.AvgPool2d(kernel_size = 28)
self.fc = nn.Linear(nb_channels, 10)
def forward(self, x):
x = F.relu(self.conv0(x))
x = self.resblocks(x)
x = F.relu(self.avg(x))
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
\(\textbf{Veit et al. (2016) interpret a residual network as an ensemble, which explains in part its stability}\)
\(e.g. \text{ with 3 blocks we have:}\)
\[\begin{align} x_1 &= x_0+f_1(x_0)\\ x_2 &= x_1+f_2(x_1)\\ x_3 &= x_2+f_3(x_2) \end{align} \]\(\text{Hence there are 4 paths:}\)
\[\begin{align} x_3 &= x_2+f_3(x_2)\\ &=x_1+f_2(x_1)+f_3(x_1+f_2(x_1))\\ &=x_0+f_1(x_0)+f_2(x_0+f_1(x_0))+f_3(x_0+f_1(x_0)+f_2(x_0+f_1(x_0))) \end{align} \]- $\textbf{(1) performance reduction correlates with the number of paths removed from the ensemble, not with the number of blocks removed} $
- \(\textbf{(2) only gradients through shallow paths matter during train. }\)
\(\\\)
\(\large\textbf{Summarize:}\)
- \(\text{ReLU to prevent the gradient from vanishing during the backward pass}\)
- \(\text{Dropout to force a distributed representation}\)
- \(\text{Batch Normalization to dynamically maintain the statistics of activations}\)
- \(\text{Identity pass-through to keep a structured gradient and distribute representation}\)
- \(\text{smart initialization to put the gradient in a good regime}\)