pytorch自然语言第一章

最新推荐文章于 2024-03-30 13:49:27 发布

Ankely

最新推荐文章于 2024-03-30 13:49:27 发布

阅读量780

点赞数 1

分类专栏：自然语言处理文章标签： pytorch 自然语言处理

本文链接：https://blog.csdn.net/weixin_45644640/article/details/107681051

版权

自然语言处理专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Chapter1. 基础介绍

- 学习目标：
The Supervised Learning Patadigm
Observation and Target Encoding
One-Hot Representation
TF Representation
TF-IDF Representation
Target Encoding
Computational Graphs
PyTorch Basics

学习目标：

回答什么是计算图
Pytorch的基本知识是哪些

The Supervised Learning Patadigm

图1.1 监督学习范式上面图是监督学习一个范式。其中要注意的是反向传播，它是一种对参数迭代更新的过程，被分为前向传递和后向传递，前向传递时计算损失函数，反向传播是更新参数。

Observation and Target Encoding

下面的观测值时文本，对其进行如图1-1的过程，可以得到如下。
在这里插入图片描述

One-Hot Representation

one-hot 表示对应于构建的词汇表向量中，如果单词出现在句子中，则在词汇表向量的相应位置设为1，其他位置为0
如下两个句子：
Time filies like an arrow.
Fruit flies like a banana.

我们对句子标记，去掉符号并让所有单词以小写字母形式表示，以此来构建词汇表{time，fruit，flies，like，a，an，arrow，banana}，使用1[w]表示令牌/单词的one-hot表示，可以得到以下向量形式，

在这里插入图片描述
例如，短语‘like a banana’，我们可以得到3x8的矩阵，其实列是8维度的0ne-hot向量，也可以对其进行‘折叠’或者而进行编码，它的二进制编码为[0,0,0,1,1,0,0,1]

TF Representation

‘Fruit files like time flies a fruit’具有以下的TF：[1,2,2,1,1,1,0,0]，TF中每个条目是单词在（语料库）句子中出现的频率，用TF(w)表示一个单词的TF

Example 1-1. Generating a ‘collapsed’ one-hot or binary representation using scilit-learn

# CountVectorizer 对于训练文本只考虑单词在文本中出现的频率
from sklearn.feature）extraction.text import CountVectorizer
# python的可视化，Seaborn 要求原始数据的输入类型为 pandas 的 Dataframe 或 Numpy 数组
import seaborn as sns

corpus = [' Time flies flies like an arrow.',
          ' Fruit files like a banana']
 one_hot_vectorizer = CountVectorizer(binary = True) #创建词袋
 one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()
# sns.heatmap 热力图 
sns.heatmap(one_hot,annot = True,cbar = False,xticklabels = vocab,yticklabels = ['Sentence 1','Sentence 2'])

在这里插入图片描述

TF-IDF Representation

IDF:表示惩罚常见的符号，并且奖励向量中罕见的符号。符号w的IDF(w)对语料库的定义为：
在这里插入图片描述其中 ${n_w}$ 表示包含w的文档数目，N为总文档住。 TF-IDF = TF*IDF

Example 1-2. Generating TF-IDF representation using scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns

tfidf_vectorizer = TfidVecotrizer()
tifdf = tfidf_vectorizer.fit_transform(corpus).toarray()
sns.heatmap (tfidf,annot = True,cbar = False,xticklabels = vocab,yticklabels = ['Sentence 1','Sentence 2'])

在这里插入图片描述

Target Encoding

给定文本，模型预测固定标签的一个。为解决这个问题，可以设置每个label为唯一的索引，但是当输出的数目很大的时候，会有问题
给定文本，预测一个数值。可通过将数字目标进行编码绑定在分类‘容器‘中（0-18，19-25，25-30），将其视为有序的分类问题

Computational Graphs

计算图：是对数学表达式建模的抽像
实现计算图，在深度学生中进行额外的记录（bookkeeping）来实现自动微分
如何建立计算图？我们以y=wx+b为例，得到下图。
在这里插入图片描述

PyTorch Basics

pytorch 是一个优化的张量操作库，0张量表示数字或标量，一阶张量是一个数学数组，或者向量；二张量是一个向量数组或者是矩阵

学习一下内容：

张量的创建
操作与张量
索引、切片与张量连接
用张量计算梯度

Create Tensors

随机初始化 torch.Tensor
均匀分布（0，1）和标准正太分布来随机初始化
用同一个标量来初始化所有张良，如0或1张量，并且可以使用fill_()方法来填充张良，注意：任何带有下划线_的pytorch方法都是指就地（in_place)操作，表示在不创新新对象的情况下，对原有的对象进行修改
用列表创建张量
用numpy数组创建张量

#首先创建辅助函数，describe(x),用于总结张量x的各种性质，如张量类型，张量维数，张量内容
def describe(x):
    print('Type:{}'.format(x.type()))
    print('shape/size:{}'.format(x.shape))
    print('values:{}'.format(x))
    
# 1. 随机初始化张量
torch.Tensor（2，3)

# 2. 利用均匀分布或标准正太分布创建张量
x= torch.rand(2.3)
y = torh.randn(2,3)

# 3. 创建全为0 的张量
zeroTe = torch.zeros(2,3)
# 创建全为1的张量
oneTe = torch.ones(2,3)
# 用fill_()来修改创建全为5的张量
fiveTe = ones.fill_(5)

# 用列表创建张量
torch.Tensor([[1,2,3],[4,5,6]])

# 用numpy转化为张量，其中转化后的张量的数据类型变成Double类型
npy = np.random.randn(2,3)
torch.from_numpy(npy)

Tensor Types and Size

类型有floatTensor、doubleTensor、longTensor,修改类型的方式有3种

使用特定类型张量构造函数
使用torch,dtype（）指定张量
使用float（）等函数修改类型

# 1.使用特定张量FloatTensor，LongTensor
x = torch.FloatTensor([[1,2,3],[4,5,6]])
describe(x)

y = torch.LongTensor([[1,2,3],[4,5,6]])
describe(y)

# 2. 使用torch和dtype
x = torch.tensor([[1, 2, 3],
                  [4, 5, 6]], dtype=torch.int64)
describe(x)

3. 使用float（）函数
y= x.float()
describe(y)

Tensor Operations

张量的操作有：

基本的数学运算操作，可以进行+、-、* 、 / 操作和.add（）之类函数的运算
对某一个维度进行运算操作，转置
索引，切片，连接
实现高效的线性代数操作

#1. 数学操作
x = torch.Tensor([[1,2,3],[4,5,6]])
describe(torch.add(x,x))
describe(x+x)

#2.  对某一个维度进行运算操作，转置
x = torch.arange(6)
 # 将修改向量维度view()
x= x.view(2,3)
# 对某一个维度求和 sum
describe(torch.sum(x,dim = 0))
describe(torch.sum(x,dim=1))
# 转置
describe(torch.transpose(x,0,1))

#3 索引、 切片、连接
# 索引
x = torch.arange(6).view(2,3)
describe(x[:1,:2])
describe(x[0,1])
# 切片
indices = torch.LongTensor([0,2])
describe(torch.index_select(x,dim =1,index = indices))

indices = torch.LongTensor([0,0])
describe(torch.index_select(x,dim = 0,index = indices))

row_indices = torch.arange(2).long()
col_indices = torch.LongTensor([0,1])
describe(x[row_indices,col_indices])
# 链接函数
x = torch.arange(6).view(2,3)
describe(torch.cat([x,x],dim=0))
describe(torch.cat([x,x],dim=1))

# 3.线性代数
x1 = torch.arange(6).view(2,3)
x1 = x1.float()
x2 = torch.ones(3,2)
x2[:,1]+=1
describe(x2)
describe(torch.mm(x1,x2))

Tensors and Computational Graphs

requeires_grad,表示是否需要求导，当它设为True的时候，记账操作开始，可以跟踪张量的梯度以及梯度函数，该信息存放再bookkeeping中，反向传播是对一张量使用backward（）方法来初始化，这个张量值由损失函数求得。

梯度表示的输出相应于输入的斜率。计算途中，每个参数都存在梯度，认为梯度是对误差信息的贡献，PyTorch中用.grad成员变量来访问节点的梯度，优化器使用.grad来个更新参数的值

x = torch.ones(2,2,requires_grad = True)
y = (x+2)*(x+5)+3
z = y.mean()
z.backward()
print(x.grad)

在这里插入图片描述

Exercise

1 Create a 2D tensor and then add a dimension of size 1 inserted at dimension 0.

x= torch.arange(4).view(2,2)
describe(x)
x = x.unsqueeze(0)
describe(x)

在这里插入图片描述 2 Remove the extra dimension you just added to the previous tensor.

x= x.squeeze(0)
describe(x)

在这里插入图片描述
3 Create a random tensor of shape 5x3 in the interval [3, 7)

x= 3+torch.rand(5,3)*(7-3)

在这里插入图片描述

4 Create a tensor with values from a normal distribution (mean=0, std=1).

x = torch.rand(2,2)
x.normal_()

在这里插入图片描述
5 Retrieve the indexes of all the nonzero elements in the tensor torch.Tensor([1, 1, 1, 0, 1]).

x= torch.Tensor([1,1,1,0,1])
torch.nonzero(x)

在这里插入图片描述
6 Create a random tensor of size (3,1) and then horizontally stack 4 copies together.

# 方法1 
x = torch.rand(3,1)
x = torch.cat([x]*4,dim=1)
describe(x)

# 方法2
a = torch.rand(3, 1) 
a= a.expand(3, 4)
describe(a)

在这里插入图片描述 7 Return the batch matrix-matrix product of two 3-dimensional matrices (a=torch.rand(3,4,5), b=torch.rand(3,5,4)).

a = torch.rand(3,4,5)
b = torch.rand(3,5,4)
# 当tensor维度为3时候，tensor b的size为（b,w,h）
torch.bmm(a,b)

在这里插入图片描述 8. return the batch matrix-matrix product of a 3D matrix and a 2D matrix (a=torch.rand(3,4,5), b=torch.rand(5,4)).

a = torch.rand(3, 4, 5) 
b = torch.rand(5, 4) 
torch.bmm(a, b.unsqueeze(0).expand(a.size(0), * b.size()))

在这里插入图片描述

b = torch.rand(5, 4) 
bl = b.unsqueeze(0)
describe(bl)

在这里插入图片描述

describe(bl.expand(a.size(0),*b.size()))

在这里插入图片描述

Ankely

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录