【PyTorch】4-模型定义（Sequential、ModuleList/ModuleDict、模型块组装、修改模型、模型保存和读取）

PyTorch：4-模型定义

注：所有资料来源且归属于thorough-pytorch(https://datawhalechina.github.io/thorough-pytorch/)，下文仅为学习记录

4.1：模型定义方式

4.1.1：nn.Module

Module 类是 torch.nn 模块里提供的一个模型构造类 (nn.Module)，是网络模型的基类。

模型定义包括两个主要部分：

1：各个部分的初始化（__init__）

2：数据流向定义（forward）

基于nn.Module的定义方式：

1：Sequential

2：ModuleList

3：ModuleDict

4.1.2：Sequential

模块为nn.Sequential()。

当前向计算为简单串联各个层的计算时，Sequential 类可以接收一个子模块的有序字典(OrderedDict) 或者一系列子模块作为参数来逐一添加 Module 的实例，前向计算就是将这些实例按添加的顺序逐⼀计算。

Sequential的定义方式：

from collections import OrderedDict
class MySequential(nn.Module):
    def __init__(self, *args):
        super(MySequential, self).__init__()
        if len(args) == 1 and isinstance(args[0], OrderedDict): # 如果传入的是一个OrderedDict有序字典
            for key, module in args[0].items():
                self.add_module(key, module)  
                # add_module方法会将module添加进self._modules(一个OrderedDict)
        else:  # 传入的是一些Module
            for idx, module in enumerate(args):
                self.add_module(str(idx), module)
    def forward(self, input):
        # self._modules返回一个 OrderedDict，保证会按照成员添加时的顺序遍历
        for module in self._modules.values():
            input = module(input)
        return input

在Sequential中，根据层名，模型层的排列方式：

1：直接排列

import torch.nn as nn
net = nn.Sequential(
        nn.Linear(784, 256),
        nn.ReLU(),
        nn.Linear(256, 10), 
        )
print(net)

"""
Sequential(
  (0): Linear(in_features=784, out_features=256, bias=True)
  (1): ReLU()
  (2): Linear(in_features=256, out_features=10, bias=True)
)
"""

2：使用OrderedDict

import collections
import torch.nn as nn
net2 = nn.Sequential(collections.OrderedDict([
          ('fc1', nn.Linear(784, 256)),
          ('relu1', nn.ReLU()),
          ('fc2', nn.Linear(256, 10))
          ]))
print(net2)

"""
Sequential(
  (fc1): Linear(in_features=784, out_features=256, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=256, out_features=10, bias=True)
)
"""

Sequential特点：

1：使用Sequential定义的模型不用再写forward，因为顺序已定义好。

2：模型定义丧失灵活性，需要在模型中间加入一个外部输入时就不适用。

4.1.3：ModuleList

模块为nn.ModuleList()。

ModuleList 接收一个子模块或层的列表作为输入，可以类似List进行append和extend操作。子模块或层的权重会自动添加到网络中。

net = nn.ModuleList([nn.Linear(784, 256), nn.ReLU()])
net.append(nn.Linear(256, 10)) # 类似List的append操作
print(net[-1])  # 类似List的索引访问
print(net)

"""
Linear(in_features=256, out_features=10, bias=True)

ModuleList(
  (0): Linear(in_features=784, out_features=256, bias=True)
  (1): ReLU()
  (2): Linear(in_features=256, out_features=10, bias=True)
)
"""

ModuleList特点：

1：nn.ModuleList 并没有定义一个网络，只是将不同的模块储存在一起。

2：ModuleList需要经过forward函数指定各个层的先后顺序后，才算完成了模型的定义。实现时可用for循环。

class model(nn.Module):
  def __init__(self, ...):
    super().__init__()
    self.modulelist = ...
    ...
    
  def forward(self, x):
    for layer in self.modulelist:
      x = layer(x)
    return x

4.1.4：ModuleDict

模块为nn.ModuleDict()。

特点：

1：和ModuleList的作用类似。

2：能更方便地为网络的层添加名称。

net = nn.ModuleDict({
    'linear': nn.Linear(784, 256),
    'act': nn.ReLU(),
})
net['output'] = nn.Linear(256, 10) # 添加
print(net['linear']) # 访问
print(net.output)
print(net)

"""
Linear(in_features=784, out_features=256, bias=True)

Linear(in_features=256, out_features=10, bias=True)

ModuleDict(
  (act): ReLU()
  (linear): Linear(in_features=784, out_features=256, bias=True)
  (output): Linear(in_features=256, out_features=10, bias=True)
)
"""

4.1.5：3种方式的比较和应用场景

Sequential：快速验证结果

ModuleList和ModuleDict：某个完全相同的层需要重复出现多次，或需要之前层的信息

4.2：利用模型块搭建U-Net

4.2.1：模型块分析

组成U-Net的模型块主要部分：

每个子块内部的两次卷积（Double Convolution）
左侧模型块之间的下采样连接，即最大池化（Max pooling）
右侧模型块之间的上采样连接（Up sampling）
输出层的处理

设置四个模型块，根据功能命名为：DoubleConv, Down, Up, OutConv

import torch
import torch.nn as nn
import torch.nn.functional as F

# Double Convolution
class DoubleConv(nn.Module):
    """(convolution => [BN] => ReLU) * 2"""

    def __init__(self, in_channels, out_channels, mid_channels=None):
        super().__init__()
        if not mid_channels:
            mid_channels = out_channels
        self.double_conv = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(inplace=True),
            # first conv
            nn.Conv2d(mid_channels, out_channels, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
            # second conv
        )

    def forward(self, x):
        return self.double_conv(x)

# Max pooling
class Down(nn.Module):
    """Downscaling with maxpool then double conv"""

    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.maxpool_conv = nn.Sequential(
            nn.MaxPool2d(2),
            DoubleConv(in_channels, out_channels)
            # 先池化，后双层卷积
        )

    def forward(self, x):
        return self.maxpool_conv(x)

# Up sampling
class Up(nn.Module):
    """Upscaling then double conv"""

    def __init__(self, in_channels, out_channels, bilinear=False):
        super().__init__()

        # if bilinear, use the normal convolutions to reduce the number of channels
        if bilinear:
            self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
            self.conv = DoubleConv(in_channels, out_channels, in_channels // 2)
        else:
            self.up = nn.ConvTranspose2d(in_channels, in_channels // 2, kernel_size=2, stride=2)
            self.conv = DoubleConv(in_channels, out_channels)

    def forward(self, x1, x2):
        x1 = self.up(x1)
        # input is CHW
        diffY = x2.size()[2] - x1.size()[2]
        diffX = x2.size()[3] - x1.size()[3]

        x1 = F.pad(x1, [diffX // 2, diffX - diffX // 2,
                        diffY // 2, diffY - diffY // 2])
        x = torch.cat([x2, x1], dim=1)
        return self.conv(x)

# Output layer
class OutConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(OutConv, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=1)

    def forward(self, x):
        return self.conv(x)

4.2.2：模型块组装

通过模型块的方式实现了代码复用

class UNet(nn.Module):
    def __init__(self, n_channels, n_classes, bilinear=False):
        super(UNet, self).__init__()
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.bilinear = bilinear

        self.inc = DoubleConv(n_channels, 64)
        self.down1 = Down(64, 128)
        self.down2 = Down(128, 256)
        self.down3 = Down(256, 512)
        factor = 2 if bilinear else 1
        self.down4 = Down(512, 1024 // factor)
        self.up1 = Up(1024, 512 // factor, bilinear)
        self.up2 = Up(512, 256 // factor, bilinear)
        self.up3 = Up(256, 128 // factor, bilinear)
        self.up4 = Up(128, 64, bilinear)
        self.outc = OutConv(64, n_classes)
        
	# 通过forward实现层之间的顺序连接
    def forward(self, x):
        x1 = self.inc(x)
        x2 = self.down1(x1)
        x3 = self.down2(x2)
        x4 = self.down3(x3)
        x5 = self.down4(x4)
        x = self.up1(x5, x4)
        x = self.up2(x, x3)
        x = self.up3(x, x2)
        x = self.up4(x, x1)
        logits = self.outc(x)
        return logits

4.3：修改模型

主要目的：修改现成模型

以PyTorch官方视觉库torchvision预定义好的模型ResNet50为例，探索如何修改模型的某一层或者某几层。

4.3.1：修改模型层

import torch
import torch.nn as nn
from collections import OrderedDict
import torchvision.models as models
net = models.resnet50()
print(net)


"""
ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
..............
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=2048, out_features=1000, bias=True)
)
"""

为了适配ImageNet预训练的权重，因此最后全连接层（fc）的输出节点数是1000。

假设要做一个10分类的问题，就应该修改模型的fc层，将其输出节点数替换为10。

**修改如下：**将模型（net）最后名称为“fc”的层替换成了名称为“classifier”的结构

classifier = nn.Sequential(OrderedDict([('fc1', nn.Linear(2048, 128)),
                          ('relu1', nn.ReLU()), 
                          ('dropout1',nn.Dropout(0.5)),
                          ('fc2', nn.Linear(128, 10)),
                          ('output', nn.Softmax(dim=1))
                          ]))
    
net.fc = classifier

4.3.2：添加外部输入

比如在CNN网络中，除了输入图像，还需要同时输入图像对应的其他信息。

基本思路：将原模型添加输入位置前的部分作为一个整体，同时在forward中定义好原模型不变的部分、添加的输入和后续层之间的连接关系，从而完成模型的修改。

修改如下：在倒数第二层增加一个额外的输入变量add_variable来辅助预测。

class Model(nn.Module):
    def __init__(self, net):
        super(Model, self).__init__()
        self.net = net			
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.5)
        self.fc_add = nn.Linear(1000, 10, bias=True) # added module
        self.output = nn.Softmax(dim=1)
        
    def forward(self, x, add_variable):
        x = self.net(x)
        x = torch.cat((self.dropout(self.relu(x)), add_variable.unsqueeze(1)),1)
        x = self.fc_add(x) # added module
        x = self.output(x)
        return x

通过torch.cat实现了tensor的拼接。

通过修改forward函数（配套定义一些层），先将1000维的tensor通过激活函数层和dropout层，再和外部输入变量"add_variable"拼接，最后通过全连接层映射到指定的输出维度10。

对外部输入变量"add_variable"进行unsqueeze操作是为了和net输出的tensor保持维度一致，常用于add_variable是单一数值 (scalar) 的情况。

此时add_variable的维度是 (batch_size, )，需要在第二维补充维数1，从而可以和tensor进行torch.cat操作。

之后对修改好的模型结构进行实例化，即可。

net = models.resnet50()
model = Model(net).cuda()

同时，训练中在输入数据的时候要给两个inputs。

outputs = model(inputs, add_var)

4.3.3：添加额外输出

目的：输出模型某一中间层的结果，以施加额外的监督。

基本思路：修改模型定义中forward函数的return变量。

**修改：**在已经定义好的模型结构上，同时输出1000维的倒数第二层和10维的最后一层结果。

class Model(nn.Module):
    def __init__(self, net):
        super(Model, self).__init__()
        self.net = net
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.5)
        self.fc1 = nn.Linear(1000, 10, bias=True)
        self.output = nn.Softmax(dim=1)
        
    def forward(self, x, add_variable):
        x1000 = self.net(x)
        x10 = self.dropout(self.relu(x1000))
        x10 = self.fc1(x10)
        x10 = self.output(x10)
        return x10, x1000

对修改好的模型结构进行实例化，即可。

import torchvision.models as models
net = models.resnet50()
model = Model(net).cuda()

同时，训练中在输出数据的时候要给两个outputs。

out10, out1000 = model(inputs, add_var)

4.4：模型的保存和读取

4.4.1：模型存储格式

PyTorch存储模型主要采用pkl，pt，pth三种格式。

pt, pth和pkl三种数据格式均支持模型权重和整个模型的存储。

4.4.2：模型存储内容

一个PyTorch模型主要包含两个部分：模型结构，权重。

模型是继承nn.Module的类，权重的数据结构是一个字典（key是层名，value是权重向量）。

存储分为两种形式：存储整个模型（结构和权重），只存储模型权重。

from torchvision import models
model = models.resnet152(pretrained=True)
save_dir = './resnet152.pth'

# 保存整个模型
torch.save(model, save_dir)
# 保存模型权重
torch.save(model.state_dict, save_dir)

4.4.3：单卡和多卡模型存储的区别

PyTorch中将模型和数据放到GPU上有两种方式：.cuda()和.to(device)

如果要使用多卡训练的话，需要对模型使用torch.nn.DataParallel。

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0' # 如果是多卡改成类似0,1,2
model = model.cuda()  # 单卡
model = torch.nn.DataParallel(model).cuda()  # 多卡

打印model对应的layer名称，差别在于多卡并行的模型每层的名称前多了一个“module”。

这种模型表示的不同可能会导致模型保存和加载过程中需要处理一些矛盾点。

4.4.4：单卡/多卡情况分类讨论

由于训练和测试所使用的硬件条件不同，在模型的保存和加载过程中可能因为单GPU和多GPU环境的不同带来模型不匹配等问题。

单卡保存+单卡加载

在使用os.envision命令指定使用的GPU后，即可进行模型保存和读取操作。

注意这里即便保存和读取时使用的GPU不同也无妨。

import os
import torch
from torchvision import models

os.environ['CUDA_VISIBLE_DEVICES'] = '0'   #这里替换成希望使用的GPU编号
model = models.resnet152(pretrained=True)
model.cuda()

save_dir = 'resnet152.pt'   #保存路径

# 保存+读取整个模型
torch.save(model, save_dir)
loaded_model = torch.load(save_dir)
loaded_model.cuda()

# 保存+读取模型权重
torch.save(model.state_dict(), save_dir)
loaded_model = models.resnet152()   #注意这里需要对模型结构有定义
loaded_model.load_state_dict(torch.load(save_dir))
loaded_model.cuda()

单卡保存+多卡加载

读取单卡保存的模型后，使用nn.DataParallel函数进行分布式训练设置即可

import os
import torch
from torchvision import models

os.environ['CUDA_VISIBLE_DEVICES'] = '0'   #这里替换成希望使用的GPU编号
model = models.resnet152(pretrained=True)
model.cuda()

# 保存+读取整个模型
torch.save(model, save_dir)

os.environ['CUDA_VISIBLE_DEVICES'] = '1,2'   #这里替换成希望使用的GPU编号
loaded_model = torch.load(save_dir)
loaded_model = nn.DataParallel(loaded_model).cuda()

# 保存+读取模型权重
torch.save(model.state_dict(), save_dir)

os.environ['CUDA_VISIBLE_DEVICES'] = '1,2'   #这里替换成希望使用的GPU编号
loaded_model = models.resnet152()   #注意这里需要对模型结构有定义
loaded_model.load_state_dict(torch.load(save_dir))
loaded_model = nn.DataParallel(loaded_model).cuda()

多卡保存+单卡加载

核心问题是：如何去掉权重字典键名中的"module"，以保证模型的统一性。

加载整个模型，直接提取模型的module属性即可。

import os
import torch
from torchvision import models

os.environ['CUDA_VISIBLE_DEVICES'] = '1,2'   #这里替换成希望使用的GPU编号

model = models.resnet152(pretrained=True)
model = nn.DataParallel(model).cuda()

# 保存+读取整个模型
torch.save(model, save_dir)

os.environ['CUDA_VISIBLE_DEVICES'] = '0'   #这里替换成希望使用的GPU编号
loaded_model = torch.load(save_dir).module

加载模型权重，可以有几个思路：

（1）保存模型时保存模型的module属性对应的权重

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2'   #这里替换成希望使用的GPU编号
import torch
from torchvision import models

save_dir = 'resnet152.pth'   #保存路径
model = models.resnet152(pretrained=True)
model = nn.DataParallel(model).cuda()

# 保存权重
torch.save(model.module.state_dict(), save_dir)

这样保存下来的模型参数就和单卡保存的模型参数一样了，可以直接加载。

（2）分布到单卡上

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2'   #这里替换成希望使用的GPU编号
import torch
from torchvision import models

model = models.resnet152(pretrained=True)
model = nn.DataParallel(model).cuda()

# 保存+读取模型权重
torch.save(model.state_dict(), save_dir)

os.environ['CUDA_VISIBLE_DEVICES'] = '0'   #这里替换成希望使用的GPU编号
loaded_model = models.resnet152()   #注意这里需要对模型结构有定义
loaded_model.load_state_dict(torch.load(save_dir))
loaded_model = nn.DataParallel(loaded_model).cuda()
loaded_model.state_dict = loaded_dict

（3）遍历字典去除module

from collections import OrderedDict
os.environ['CUDA_VISIBLE_DEVICES'] = '0'   #这里替换成希望使用的GPU编号

loaded_dict = torch.load(save_dir)

new_state_dict = OrderedDict()
for k, v in loaded_dict.items():
    name = k[7:] # module字段在最前面，从第7个字符开始就可以去掉module
    new_state_dict[name] = v #新字典的key值对应的value一一对应

loaded_model = models.resnet152()   #注意这里需要对模型结构有定义
loaded_model.state_dict = new_state_dict
loaded_model = loaded_model.cuda()

（4）使用replace操作去除module

loaded_model = models.resnet152()    
loaded_dict = torch.load(save_dir)
loaded_model.load_state_dict({k.replace('module.', ''): v for k, v in loaded_dict.items()})

多卡保存+多卡加载

由于是模型保存和加载都使用的是多卡，因此不存在模型层名前缀不同的问题。

但多卡状态下存在一个device匹配的问题，即保存整个模型时会同时保存所使用的GPU id等信息。读取时若这些信息和当前使用的GPU信息不符，则可能会报错或者程序不按预定状态运行。

【1】读取整个模型再使用nn.DataParallel进行分布式训练设置

很有可能，训练时数据所在device和模型所在device不一致而报错。

【2】读取整个模型而不使用nn.DataParallel进行分布式训练设置

n是保存的模型使用的GPU个数，如果指定的GPU个数少于n，则会报错。

多卡模式下，建议使用权重的方式存储和读取模型

import os
import torch
from torchvision import models

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2'   #这里替换成希望使用的GPU编号

model = models.resnet152(pretrained=True)
model = nn.DataParallel(model).cuda()

# 保存+读取模型权重，强烈建议！！
torch.save(model.state_dict(), save_dir)
loaded_model = models.resnet152()   #注意这里需要对模型结构有定义
loaded_model.load_state_dict(torch.load(save_dir)))
loaded_model = nn.DataParallel(loaded_model).cuda()

如果只保存的整个模型，也可以采用提取权重的方式构建新的模型

# 读取整个模型
loaded_whole_model = torch.load(save_dir)
loaded_model = models.resnet152()   #注意这里需要对模型结构有定义
loaded_model.state_dict = loaded_whole_model.state_dict
loaded_model = nn.DataParallel(loaded_model).cuda()

4.4.5：其他参数的保存和读取

比如，训练的epoch数、训练的loss，优化器的参数，动态调整学习策略的参数等。

这些参数可以通过字典的形式保存在一个文件里，然后在读取模型时一起读取。

方法【1】

torch.save({
        'model': model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'lr_scheduler': lr_scheduler.state_dict(),
        'epoch': epoch,
        'args': args,
    }, checkpoint_path)

方法【2】

checkpoint = torch.load(checkpoint_path)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
epoch = checkpoint['epoch']
args = checkpoint['args']