part2

最新推荐文章于 2024-01-23 07:02:00 发布

qq_44855257

最新推荐文章于 2024-01-23 07:02:00 发布

阅读量196

点赞数

本文链接：https://blog.csdn.net/qq_44855257/article/details/104301284

版权

文本预处理

文本预处理的四个步骤：读入文本、分词、建立字典并映射索引、转换。
分词过程将句子转换为词的序列

def tokenize(sentences, token='word'):
    """Split sentences into word or char tokens"""
    if token == 'word':
        return [sentence.split(' ') for sentence in sentences]
    elif token == 'char':
        return [list(sentence) for sentence in sentences]
    else:
        print('ERROR: unkown token type '+token)

tokens = tokenize(lines)
tokens[0:2]

构建字典将每个词映射到唯一的索引编号。

class Vocab(object):
    def __init__(self, tokens, min_freq=0, use_special_tokens=False):
        counter = count_corpus(tokens)  # : 
        self.token_freqs = list(counter.items())
        self.idx_to_token = []
        if use_special_tokens:
            # padding, begin of sentence, end of sentence, unknown
            self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
            self.idx_to_token += ['', '', '', '']
        else:
            self.unk = 0
            self.idx_to_token += ['']
        self.idx_to_token += [token for token, freq in self.token_freqs
                        if freq >= min_freq and token not in self.idx_to_token]
        self.token_to_idx = dict()
        for idx, token in enumerate(self.idx_to_token):
            self.token_to_idx[token] = idx

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

def count_corpus(sentences):
    tokens = [tk for st in sentences for tk in st]
    return collections.Counter(tokens)  # 返回一个字典，记录每个词的出现次数

将句子由单词序列转换为索引序列。

for i in range(8, 10):
    print('words:', tokens[i])
    print('indices:', vocab[tokens[i]])

也可以使用spaCy和NLTK进行分词，这样能够避免一些错误处理现象。

语言模型

P(w1,w2,w3,w4)=P(w1)P(w2∣w1)P(w3∣w1,w2)P(w4∣w1,w2,w3)
计算给定w1的情况下w2的条件概率：
P(w2|w1)=n(w1,w2)/n(w1)
n(w1,w2)为以w1为第一个词w2为第二个词的文本数量
可通过马尔可夫假设简化模型，假设一个词的出现之于前面n个词线管，称为n阶马尔可夫链，当n=1、2、3时
P(w1,w2,w3,w4)=P(w1)P(w2)P(w3)P(w4)
P(w1,w2,w3,w4)=P(w1)P(w2∣w1)P(w3∣w2)P(w4∣w3)
P(w1,w2,w3,w4)=P(w1)P(w2∣w1)P(w3∣w1,w2)P(w4∣w2,w3)
时间数据随机采样

import torch
import random
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # 减1是因为对于长度为n的序列，X最多只有包含其中的前n - 1个字符
    num_examples = (len(corpus_indices) - 1) // num_steps  # 下取整，得到不重叠情况下的样本个数
    example_indices = [i * num_steps for i in range(num_examples)]  # 每个样本的第一个字符在corpus_indices中的下标
    random.shuffle(example_indices)

    def _data(i):
        # 返回从i开始的长为num_steps的序列
        return corpus_indices[i: i + num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    for i in range(0, num_examples, batch_size):
        # 每次选出batch_size个随机样本
        batch_indices = example_indices[i: i + batch_size]  # 当前batch的各个样本的首字符的下标
        X = [_data(j) for j in batch_indices]
        Y = [_data(j + 1) for j in batch_indices]
        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)

相邻采样：小批量位置相邻

def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    corpus_len = len(corpus_indices) // batch_size * batch_size  # 保留下来的序列的长度
    corpus_indices = corpus_indices[: corpus_len]  # 仅保留前corpus_len个字符
    indices = torch.tensor(corpus_indices, device=device)
    indices = indices.view(batch_size, -1)  # resize成(batch_size, )
    batch_num = (indices.shape[1] - 1) // num_steps
    for i in range(batch_num):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

循环神经网络

采用one-hot相量，字典大小N

def one_hot(x, n_class, dtype=torch.float32):
    result = torch.zeros(x.shape[0], n_class, dtype=dtype, device=x.device)  # shape: (n, n_class)
    result.scatter_(1, x.long().view(-1, 1), 1)  # result[i, x[i, 0]] = 1
    return result
    
x = torch.tensor([0, 2])
x_one_hot = one_hot(x, vocab_size)
print(x_one_hot)
print(x_one_hot.shape)
print(x_one_hot.sum(axis=1))

为避免梯度爆炸进行剪裁梯度，设一个剪裁梯度阈值θ ，剪裁后梯度的L2小于θ 。

def grad_clipping(params, theta, device):
    norm = torch.tensor([0.0], device=device)
    for param in params:
        norm += (param.grad.data ** 2).sum()
    norm = norm.sqrt().item()
    if norm > theta:
        for param in params:
            param.grad.data *= (theta / norm)

订正：
循环神经网络通过不断循环使用同样一组参数来应对不同长度的序列，故网络的参数数量与输入序列长度无关。

qq_44855257

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
part2

Softmax处理图像分类问题，2*2像素灰度图像。x1、x2、x3、x4为四个像素，标签对应离散值，绘制神经网络图，将输出当作置信度。为解决输出范围不确定和离散值误差问题，采用softmax operator 来解决问题，其将输出值变为和为1的概率分布。采用交叉熵来衡量概率分布的差异，最小化交叉熵损失函数等价于最大化训练数据集所有标签类别的联合预测概率。过程中我们需要获取Fashion-M...
复制链接

扫一扫