scient
scient一个用python实现科学计算相关算法的包,包括自然语言、图像、神经网络、优化算法、机器学习、图计算等模块。
scient源码和编译安装包可以在Python package index
获取。
The source code and binary installers for the latest released version are available at the [Python package index].
https://pypi.org/project/scient
可以用pip
安装scient
。
You can install scient
like this:
pip install scient
也可以用setup.py
安装。
Or in the scient
directory, execute:
python setup.py install
scient.neuralnet
神经网络相关算法模块,包括attention、transformer、bert、lstm、resnet、crf、dataset、fit等。
scient.neuralnet.transformer
transformer模块,实现以多头注意力机制为基础的transformer模型,包括Transformer、ViTransformer、Former、Encoder、Decoder。
scient.neuralnet.transformer.ViTransformer
应用于图像的transformer模型Vision Transformer。
scient.neuralnet.transformer.ViTransformer(n_class,patch_size=16,in_channels = 3,in_shape=(224,224),embed_size=768,
n_head=8,n_encode_layer=6,n_decode_layer=6,ff_size=2048,dropout=0.1,quantile=None,
eps:float=1e-9)
Parameters
n_class=1000 int 分类数,ViTransformer预训练任务为图像分类
patch_size=16 int 或 (int,int),patch大小
in_channels = 3 int 图像通道数
in_shape=(224,224) 图像高宽
embed_size=768 int embedding向量长度
以下为transformer参数
n_head:int=8 int 多头注意力head数量
n_encode_layer:int=6 encoder层数
n_decode_layer:int=6 decoder层数
ff_size:int=2048 feed forward 参数
dropout:float=0.1
quantile=None 稀疏注意力分位数,在多头注意力中小于该分位数值的注意力得分会被忽略
eps:float=1e-9
scient.neuralnet.transformer.ViTransformer.forward(source,pad_mask)
Parameters
soruce torch.Tensor 输入这序列
pad_mask torch.Tensor 用pad填充的掩码
Returns
torch.Tensor batch_size * n_class
scient.neuralnet.transformer.ViTransformer.encode(source,pad_mask)
Parameters
同forward
Returns
torch.Tensor batch_size * n_patch+1 * embed_size
Algorithms
ViTransformer(ViT)全称Vision Transformer,是将Transformer架构应用于计算机视觉任务的深度学习模型。
ViTransformer与Transformer最大的不同,是采用PatchEmbedding和PositionEmbedding将图像映射成向量,在Embedding之后采用Transformer的Encoder模块进行计算。下图是Transformer模型架构,其中绿框是ViTransformer复用的架构,红框部分ViTransformer与Transformer不同。
ViTransformer采用如下图所示的PatchEmbedding+PositionEmbedding代替上图红框部的内容,
- PatchEmbedding
仿照Transformer在文本上的应用,ViTransformer先将图像切分成若干个大小相同的碎片(Patch),然后按照顺将每个碎片(Patch)视为一个Token,所有碎片(Patch)平铺后组成一个输入序列。图像切分碎片通过卷积来实现,设置卷积核大小以及步长可以控制图像碎片的大小以及分多少块。
class PatchEmbedding(nn.Module):
def __init__(self, patch_size = 16, in_channels = 3, in_shape = [224,224], embed_size = 768):
super().__init__()
self.n_patch = (in_shape[0] // patch_size) * (in_shape[1] // patch_size) #n_patch = 196
self.projection = nn.Conv2d(in_channels, embed_size, kernel_size=patch_size, stride=patch_size)
def forward(self, x):
x = self.projection(x) #先进行卷积 [1,3,224,224] ->[1,768,14,14]
x = x.flatten(2) #[1,768,196]
x = x.transpose(1, 2) #[1,196,768]
return x
n_patch可以切分的图像碎片的数量,projection对输入图像进行卷积分块,采用上述代码,可以将一个3 * 224 * 224大小的图像,转换成一个196 * 768的矩阵,其中196是序列长度n_patch,768是embed_size。
- PositionEmbedding
PositionEmbedding设置成n_patch+1 * embed_size的矩阵,后面讲这里为什么是n_patch + 1。
PositionEmbedding=nn.Parameter(torch.zeros(1, n_patch + 1, embed_size))
- BOS token
BOS是Begain of Sequence,在文本Transformer语料处理时会在句首加上,有时也会在句末加上,或在句子对中间加上等占位符,当然在文本训练中,这些占位符有其特定的用途。ViTransformer会在图像碎片序列的前面加上占位符,在序列经过Encoder处理后,用对应的向量作为图片的特征,进行预测误差计算,如下图所示:
将如下bos添加到PatchEmbedding的第一列,这样PatchEmbedding与PositionEmbedding的尺寸就一样,两者相加后得到图像的Embedding映射。
bos=nn.Parameter(torch.zeros(1, 1, embed_size))
下面是采用scient构建ViTransformer模型,并在数据集上进行训练的完整代码:
import torch
import torchvision.transforms as tt
from torchvision.datasets import ImageFolder
from scient.neuralnet import transformer,fit
data_path='imagewoof2-160'
#%% 数据
train_tfms = tt.Compose([tt.RandomCrop(160, padding=4, padding_mode='reflect'),
tt.RandomHorizontalFlip(),
tt.ToTensor(),
tt.Normalize(mean=(0.4914, 0.4822, 0.4465),
std=(0.2023, 0.1994, 0.2010),
inplace=True)])
valid_tfms = tt.Compose([tt.Resize([160,160]),
tt.ToTensor(),
tt.Normalize(mean=(0.4914, 0.4822, 0.4465),
std=(0.2023, 0.1994, 0.2010))])
# 创建ImageFolder对象
data_train = ImageFolder(data_path+'/train', train_tfms)
data_eval = ImageFolder(data_path+'/val', valid_tfms)
# 设置批量大小
batch_size = 32
# 创建训练集和验证集的数据加载器
train_loader = torch.utils.data.DataLoader(data_train, batch_size=batch_size, shuffle=True)
eval_loader = torch.utils.data.DataLoader(data_eval, batch_size=batch_size, shuffle=False)
#%% 模型
model=transformer.ViTransformer(n_class=10,patch_size=16,in_shape=(160,160),embed_size=768)
#%%训练
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
max_lr = 0.0001
weight_decay = 1e-4
n_iter=10
optimizer = torch.optim.Adam(model.parameters(), max_lr, weight_decay=weight_decay)
# Set up one-cycle learning rate scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, epochs=n_iter,
steps_per_epoch=len(train_loader))
loss_func=torch.nn.CrossEntropyLoss()
def perform_func(y_hat,y):
y_hat,y=torch.cat(y_hat),torch.cat(y)
_,y_hat=y_hat.max(dim=1)
return round((y_hat==y).sum().item()/len(y),5)
def grad_func(p):
torch.nn.utils.clip_grad_value_(p,0.1)
model=fit.set(model,optimizer=optimizer,scheduler=scheduler,grad_func=grad_func,loss_func=loss_func,perform_func=perform_func,n_iter=n_iter,device=device)
model.fit(train_loader,eval_loader,mode=('input','target'))
数据集的下载地址:
imagewoof数据集下载
通过训练过程可以看出,仅经过10次迭代,在测试集上的准确率就从0.15169上升到0.41511,说明ViTransformer模型对图像的学习能力是很强大的。
train iter 0: avg_batch_loss=2.26922 perform=0.15711: 100%|██████████| 283/283 [06:01<00:00, 1.28s/it]
eval iter 0: avg_batch_loss=2.29831 perform=0.15169: 100%|██████████| 123/123 [01:02<00:00, 1.96it/s]
train iter 1: avg_batch_loss=2.21478 perform=0.18072: 100%|██████████| 283/283 [04:19<00:00, 1.09it/s]
eval iter 1: avg_batch_loss=2.30398 perform=0.16899: 100%|██████████| 123/123 [01:00<00:00, 2.03it/s]
train iter 2: avg_batch_loss=2.09721 perform=0.22504: 100%|██████████| 283/283 [04:16<00:00, 1.10it/s]
eval iter 2: avg_batch_loss=2.35796 perform=0.16772: 100%|██████████| 123/123 [01:01<00:00, 2.01it/s]
train iter 3: avg_batch_loss=1.98703 perform=0.25939: 100%|██████████| 283/283 [04:43<00:00, 1.00s/it]
eval iter 3: avg_batch_loss=2.01205 perform=0.27895: 100%|██████████| 123/123 [01:08<00:00, 1.80it/s]
train iter 4: avg_batch_loss=1.89123 perform=0.30603: 100%|██████████| 283/283 [04:45<00:00, 1.01s/it]
eval iter 4: avg_batch_loss=1.96699 perform=0.28607: 100%|██████████| 123/123 [01:02<00:00, 1.96it/s]
train iter 5: avg_batch_loss=1.82591 perform=0.32565: 100%|██████████| 283/283 [04:16<00:00, 1.10it/s]
eval iter 5: avg_batch_loss=1.94186 perform=0.29294: 100%|██████████| 123/123 [00:59<00:00, 2.07it/s]
train iter 6: avg_batch_loss=1.74809 perform=0.36332: 100%|██████████| 283/283 [04:19<00:00, 1.09it/s]
eval iter 6: avg_batch_loss=1.75617 perform=0.36777: 100%|██████████| 123/123 [01:03<00:00, 1.94it/s]
train iter 7: avg_batch_loss=1.64716 perform=0.40698: 100%|██████████| 283/283 [04:20<00:00, 1.09it/s]
eval iter 7: avg_batch_loss=1.74583 perform=0.38763: 100%|██████████| 123/123 [01:00<00:00, 2.04it/s]
train iter 8: avg_batch_loss=1.55075 perform=0.44110: 100%|██████████| 283/283 [04:32<00:00, 1.04it/s]
eval iter 8: avg_batch_loss=1.67609 perform=0.40773: 100%|██████████| 123/123 [01:00<00:00, 2.03it/s]
train iter 9: avg_batch_loss=1.49680 perform=0.46880: 100%|██████████| 283/283 [04:17<00:00, 1.10it/s]
eval iter 9: avg_batch_loss=1.65652 perform=0.41511: 100%|██████████| 123/123 [01:01<00:00, 2.01it/s]