-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
8 changed files
with
1,375 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,281 @@ | ||
# 探究实验 | ||
# 探究实验 | ||
|
||
# Deep Learning Lab Report - CIFAR-10 | ||
|
||
## 0 | ||
batch_size是不是越大越好? | ||
|
||
Answer: 不是 | ||
使用不同的batch_size,在测试集上的准确率分别为: | ||
| batch_size | accuracy | | ||
| :--------: | :------: | | ||
| 16 | 65% | | ||
| 32 | 63% | | ||
| 64 | 56% | | ||
| 128 | 47% | | ||
|
||
这说明了batch_size并不是越大越好。 | ||
|
||
可能的原因是: | ||
- 大的batch会导致模型更容易收敛到局部最优解,而不是全局最优解;小的batch实际起了退火的作用 | ||
- 大的batch会导致模型更新次数减少,收敛速度变慢 | ||
|
||
另外,大的batch也可能导致显存容量不足,无法训练 | ||
|
||
|
||
|
||
## 1 | ||
在训练集那里的transform试一下RandomHorizontalFlip,效果会更好吗? | ||
|
||
Answer: | ||
我们对数据集做了RandomHorizontalFlip的Augmentation,改动后在测试集上的准确率为56%,与原来相比没有明显变化 | ||
其Training Loss在最后一个epoch平均为1.226,相较原来的版本中的1.209高了一点,这是在预期之中的,因为增加Augmentation之后过拟合的程度降低了,Training Loss会提高 | ||
实际上,由于这个Augmentation是比较弱的,所以效果并不明显 | ||
|
||
|
||
|
||
## 2 | ||
换一个optimizer, 使效果更好一些 | ||
|
||
Answer: 我们可以使用Adam来大幅优化训练效率 | ||
```python | ||
self.optimizer = torch.optim.Adam(self.model.parameters(), lr=Config.LEARNING_RATE) | ||
``` | ||
用其训练后,收敛速度大幅提高,且最终精度也大幅提高: | ||
|
||
对比它与原来的SGD在每个epoch的表现如下: | ||
| epoch | SGD | Adam | | ||
| :---: | :---: | :---: | | ||
| 1 | 16% | 48% | | ||
| 2 | 29% | 54% | | ||
| 3 | 36% | 57% | | ||
| 4 | 40% | 60% | | ||
| 5 | 43% | 62% | | ||
| 6 | 46% | 62% | | ||
| 7 | 48% | 63% | | ||
| 8 | 50% | 64% | | ||
| 9 | 51% | 63% | | ||
| 10 | 53% | 65% | | ||
| 11 | 55% | 64% | | ||
| 12 | 56% | 65% | | ||
|
||
这样的结果在预期之中,毕竟Adam已经成为广大深度学习任务的标配优化器了,它在二阶上对动量进行修正,使得在loss的landscape较为极端的情况下,仍能保持较好的收敛性。 | ||
|
||
## 3 | ||
保持epoch数不变,加一个scheduler,是否能让效果更好一些 | ||
|
||
Answer: | ||
我们使用最简单的StepLR来进行学习率的调整,我们设置的参数令其每过5个epoch学习率减半,初始学习率设为0.002,分别对SGD与Adam测试如下: | ||
| Group | SGD-Accuracy | SGD-Loss | Adam-Accuracy | Adam-Loss | | ||
| :----------: | :----------: | :------: | :-----------: | :-------: | | ||
| No-Scheduler | 56% | 1.211 | 65% | 0.779 | | ||
| Step-LR | 60% | 1.068 | 65% | 0.668 | | ||
|
||
可以看到,使用StepLR后,SGD的效果有了一定的提升,而Adam的效果没有明显变化,但两个优化器训出来的Loss都有了明显的下降,这说明了学习率的调整是有效的,的确和调小学习率能促进收敛的预期一致,而Adam的效果没有明显变化可能是因为过拟合了。 | ||
|
||
|
||
## 4 | ||
根据Net() 生成 Net1(), 加入三个batch_normalization层,显示测试结果 | ||
|
||
Answer: | ||
在两个卷积层和第一个线性层后加入BN层后,在测试集上的准确率为65%,较原来提升了9%,可见BatchNorm对效果的提升极为显著。 | ||
BatchNorm能提升效果主要是因为: | ||
- 可以保证各层数据特征分布的稳定性,使得模型更容易收敛,不容易梯度消失或爆炸 | ||
- 可以保证隐含层输出集中在一般激活函数的主要非线性区 | ||
|
||
具体代码如下: | ||
```python | ||
class BnModel(nn.Module): # Add Batch Normalization (Net1) | ||
def __init__(self): | ||
super().__init__() | ||
self.conv1 = nn.Conv2d(3, 6, 5) | ||
self.pool = nn.MaxPool2d(2, 2) | ||
self.bn1 = nn.BatchNorm2d(6) | ||
self.conv2 = nn.Conv2d(6, 16, 5) | ||
self.bn2 = nn.BatchNorm2d(16) | ||
|
||
self.fc1 = nn.Linear(16 * 5 * 5, 120) | ||
self.bn3 = nn.BatchNorm1d(120) | ||
self.fc2 = nn.Linear(120, 84) | ||
self.fc3 = nn.Linear(84, 10) | ||
|
||
def forward(self, x): | ||
x = self.pool(F.relu(self.conv1(x))) | ||
x = self.bn1(x) | ||
x = self.pool(F.relu(self.conv2(x))) | ||
x = self.bn2(x) | ||
x = torch.flatten(x, 1) # flatten all dimensions except batch | ||
x = F.relu(self.fc1(x)) | ||
x = self.bn3(x) | ||
x = F.relu(self.fc2(x)) | ||
x = self.fc3(x) | ||
return x | ||
``` | ||
|
||
## 5 | ||
根据Net() 生成Net2(), 使用Kaiming初始化卷积与全连接层,显示测试结果 | ||
|
||
Answer: | ||
加入Kaiming初始化后,在测试集上的准确率为57%,较原来提升了1%。 | ||
|
||
实际上,`nn.Linear`及`nn.Conv2d`的父类`nn._ConvNd`的初始化默认就使用了Kaiming初始化,它们初始化时都调用了以下函数: | ||
```python | ||
def reset_parameters(self) -> None: | ||
# Setting a=sqrt(5) in kaiming_uniform is the same as initializing with | ||
# uniform(-1/sqrt(in_features), 1/sqrt(in_features)). For details, see | ||
# https://github.com/pytorch/pytorch/issues/57109 | ||
init.kaiming_uniform_(self.weight, a=math.sqrt(5)) | ||
if self.bias is not None: | ||
fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight) | ||
bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0 | ||
init.uniform_(self.bias, -bound, bound) | ||
``` | ||
|
||
此处的提升可能是因为Kaiming初始化的参数不一样。 | ||
|
||
该网络具体代码如下: | ||
```python | ||
class KaimingInitModel(nn.Module): # Using Kaiming Initialization (Net2) | ||
def __init__(self): | ||
super().__init__() | ||
self.conv1 = nn.Conv2d(3, 6, 5) | ||
self.pool = nn.MaxPool2d(2, 2) | ||
self.conv2 = nn.Conv2d(6, 16, 5) | ||
|
||
self.fc1 = nn.Linear(16 * 5 * 5, 120) | ||
self.fc2 = nn.Linear(120, 84) | ||
self.fc3 = nn.Linear(84, 10) | ||
|
||
torch.nn.init.kaiming_uniform_(self.conv1.weight, a=0, mode='fan_in', nonlinearity='relu') | ||
torch.nn.init.kaiming_uniform_(self.conv2.weight, a=0, mode='fan_in', nonlinearity='relu') | ||
torch.nn.init.kaiming_uniform_(self.fc1.weight, a=0, mode='fan_in', nonlinearity='relu') | ||
torch.nn.init.kaiming_uniform_(self.fc2.weight, a=0, mode='fan_in', nonlinearity='relu') | ||
torch.nn.init.kaiming_uniform_(self.fc3.weight, a=0, mode='fan_in', nonlinearity='relu') | ||
|
||
def forward(self, x): | ||
x = self.pool(F.relu(self.conv1(x))) | ||
x = self.pool(F.relu(self.conv2(x))) | ||
x = torch.flatten(x, 1) # flatten all dimensions except batch | ||
x = F.relu(self.fc1(x)) | ||
x = F.relu(self.fc2(x)) | ||
x = self.fc3(x) | ||
return x | ||
``` | ||
|
||
## 6 | ||
根据Net()生成Net3(),将Net()中的通道数加到原来的2倍,显示测试结果 | ||
|
||
Answer: | ||
通道数翻倍后,在测试集上的准确率为60%,较原来提升了4%,可见提升通道数有一定效果。这主要是因为通道数增加后,卷积核的种类增加了,能提取的特征种类也增加了。 | ||
|
||
具体代码如下: | ||
```python | ||
class DoubleChannelModel(nn.Module): # Doubling the Channel (Net3) | ||
def __init__(self): | ||
super().__init__() | ||
self.conv1 = nn.Conv2d(3, 12, 5) | ||
self.pool = nn.MaxPool2d(2, 2) | ||
self.conv2 = nn.Conv2d(12, 32, 5) | ||
|
||
self.fc1 = nn.Linear(32 * 5 * 5, 120) | ||
self.fc2 = nn.Linear(120, 84) | ||
self.fc3 = nn.Linear(84, 10) | ||
|
||
def forward(self, x): | ||
x = self.pool(F.relu(self.conv1(x))) | ||
x = self.pool(F.relu(self.conv2(x))) | ||
x = torch.flatten(x, 1) # flatten all dimensions except batch | ||
x = F.relu(self.fc1(x)) | ||
x = F.relu(self.fc2(x)) | ||
x = self.fc3(x) | ||
return x | ||
``` | ||
|
||
|
||
|
||
## 7 | ||
在不改变Net()的基础结构(卷积层数、全连接层数不变)和训练epoch数的前提下,你能得到最好的结果是多少? | ||
|
||
Answer: | ||
我们进行了如下改进: | ||
- 使用Adam优化器,增加了0.0001的WEIGHT_DECAY,使用`CosineAnnealingLR`余弦退火学习率调整器,初始学习率设为0.002 | ||
- 同上述BN网络加入三个BatchNorm层,并加入了三个p=0.1的dropout层,这三个层正好加在BatchNorm层之后 | ||
- 前两层卷积层通道数分别改为256与256 | ||
- 采用了大量数据增强,具体如下(其中AutoAugment()来自论文'AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty - https://arxiv.org/abs/1912.02781): | ||
```python | ||
def get_transform(self): | ||
res = [] | ||
res.append(transforms.RandomHorizontalFlip(p=0.5)) | ||
res.extend([transforms.Pad(2, padding_mode='constant'), | ||
transforms.RandomCrop([32,32])]) | ||
res.append(transforms.RandomApply([AutoAugment()], p=0.6)) | ||
res.append(transforms.ToTensor()) | ||
res += [transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))] | ||
return transforms.Compose(res) | ||
``` | ||
|
||
最终在测试集上的准确率为83%,较原来提升了27%。 | ||
|
||
模型具体代码如下: | ||
```python | ||
class BetterBaselineModel(nn.Module): # Better Baseline Model (Net4) | ||
def __init__(self): | ||
super().__init__() | ||
channel1 = 256 | ||
channel2 = 256 | ||
self.conv1 = nn.Conv2d(3, channel1, 5) | ||
self.pool = nn.MaxPool2d(2, 2) | ||
self.bn1 = nn.BatchNorm2d(channel1) | ||
self.dropout1 = nn.Dropout(0.1) | ||
self.conv2 = nn.Conv2d(channel1, channel2, 5) | ||
self.bn2 = nn.BatchNorm2d(channel2) | ||
self.dropout2 = nn.Dropout(0.1) | ||
|
||
self.fc1 = nn.Linear(channel2 * 5 * 5, 120) | ||
self.bn3 = nn.BatchNorm1d(120) | ||
self.dropout3 = nn.Dropout(0.1) | ||
self.fc2 = nn.Linear(120, 84) | ||
self.fc3 = nn.Linear(84, 10) | ||
|
||
def forward(self, x): | ||
x = self.pool(F.relu(self.conv1(x))) | ||
x = self.bn1(x) | ||
x = self.dropout1(x) | ||
x = self.pool(F.relu(self.conv2(x))) | ||
x = self.bn2(x) | ||
x = self.dropout2(x) | ||
x = torch.flatten(x, 1) # flatten all dimensions except batch | ||
x = F.relu(self.fc1(x)) | ||
x = self.bn3(x) | ||
x = self.dropout3(x) | ||
x = F.relu(self.fc2(x)) | ||
x = self.fc3(x) | ||
return x | ||
``` | ||
|
||
|
||
## 8 | ||
使用ResNet18(),显示测试结果 | ||
|
||
Answer: | ||
使用resnet-18,上一问其它参数不变,在测试集上的准确率为83%,与上一问的结果相同。其在较少参数下能达到这样的效果,说明了其优秀的性能,这可能是因为: | ||
- 其残差连接能够有效地缓解梯度消失问题,方便层数堆叠 | ||
- 让网络学习残差比学习原始特征更容易,免去了对额外的恒等映射的行为的学习 | ||
|
||
具体代码如下: | ||
```python | ||
class ResNet(nn.Module): # Resnet18 | ||
def __init__(self): | ||
super().__init__() | ||
self.model = torchvision.models.resnet18(pretrained=False) | ||
if Config.PRETRAINED: | ||
self.model.load_state_dict(torch.load(Config.RESNET_PRETRAINED_PATH)) | ||
for param in self.model.parameters(): | ||
param.requires_grad = False | ||
self.model.fc = nn.Linear(512, 10) | ||
self.model.fc.requires_grad_(True) | ||
|
||
def forward(self, x): | ||
x = self.model(x) | ||
return x | ||
``` |
Oops, something went wrong.