第一天：NLP初学者-trax库学习

最新推荐文章于 2022-12-15 15:52:37 发布

听话_让我抗抗

最新推荐文章于 2022-12-15 15:52:37 发布

阅读量1.3k

点赞数 4

文章标签：自然语言处理

本文链接：https://blog.csdn.net/qq_41803946/article/details/118496338

版权

官网对于trax的介绍：

Trax is an end-to-end library for deep learning that focuses on clear code and speed. It is actively used and maintained in the Google Brain team.

总结就是Trax库用以机器学习，速度快、代码清晰，由谷歌团队维护。

最近看了吴恩达团队NLP的课，开始接触到trax库，但是奈何找不到很好的中文资源（可能是我太菜了，没找到）。

这里边学习边记录。

目前学习材料是trax在github上的地址：

https://github.com/google/trax

官方提供的文档：

https://trax-ml.readthedocs.io/en/latest/

下面我们开始吧！

1、如何在windows下安装trax库

trax库依赖于jaxlib库，而jaxlib库官方目前只支持linux与macos。（windows表示不服！）

为了windows选手们的幸福，我最终找到了方案：

windows下安装trax库请参照：

https://blog.csdn.net/qq_41803946/article/details/118469298

电脑不好~实在不想开台虚拟机学习这个库，不然太累了。

2、trax库强大的导数计算工具

假设 $f(x)=2x^{2}$

那么其导数 $f'(x)=4x$

在trax中如何得到导数呢？

import trax
def f(x):
  return 2.0 * x * x

grad_f = trax.fastmath.grad(f)#返回f的导数给grad_f，并且grad_f的type也是一个function

print(type(grad_f))

print(f'grad(2x^2) at 1 = {grad_f(1.0)}')

#输出：
#<class 'function'>
#grad(2x^2) at 1 = 4.0

哈哈，是不是特别快，特别简单！（妈妈再也不用担心我不会算导数了）

3、trax中的numpy

trax库基于 JAX和TensorFlow，因此trax可以轻松的运行这两个库中的一些好用的工具，并且能做到更快！

下面分享一些矩阵的基本操作，以及如何调用数学函数。

from trax.fastmath import numpy as fastnp
trax.fastmath.use_backend('jax')  # 选择使用的后台'jax' 或 'tensorflow-numpy'.

#自定义矩阵
matrix  = fastnp.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f'matrix = \n{matrix}')
#单位矩阵
vector = fastnp.ones(3)
print(f'vector = {vector}')
#矩阵点积
product = fastnp.dot(vector, matrix)
print(f'product = {product}')
#调用数学函数
tanh = fastnp.tanh(product)
print(f'tanh(product) = {tanh}')

#输出：
#matrix = 
#[[1 2 3]
# [4 5 6]
# [7 8 9]]
#vector = [1. 1. 1.]
#product = [12. 15. 18.]
#tanh(product) = [1. 1. 1.]

这里还是很友好的，基本上和Tensorflow没有啥不同。

4、数据加载与数据预处理

得益于基于TensorFlow，Trax可以直接使用TensorFlow Datasets来进行训练。这里我们先按照官方给的示例进行加载。

#这里创建两个stream是因为要同时为两个task使用，一个是训练样本，一个是测试样本
train_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=True)()
eval_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=False)()
print(next(train_stream))  # 看一个样本.

#(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", 0)

然后是数据的预处理（主要是根据词汇表进行Tokenize（把数据中每条推特转化为单词（在词汇表中对应下标）数组的形式））

比如在词汇表中各文字对应的下标为：

word	index
I	1
happy	2
am	3
。	4

那么推特I am happy 预处理后得到==>[1,3,2]

当然还有其他一些操作，祥见代码，（其中en_8k.subword文件可以在https://github.com/google/trax/tree/master/trax/data/testdata获取）：

#使用trax.data模块，您可以创建输入处理管道
data_pipeline = trax.data.Serial(
    # 令牌化data（把单词变成数组形式），这步中需要用到词汇表'en_8k.subword'
    trax.data.Tokenize(vocab_dir='.\\',vocab_file='en_8k.subword', keys=[0]),

    #打乱data的排序
    trax.data.Shuffle(),
    #过滤
    trax.data.FilterByLength(max_length=2048, length_keys=[0]),
    #数据分批、同个大小等级的数据分到一个桶中
    trax.data.BucketByLength(boundaries=[  32, 128, 512, 2048],
                             batch_sizes=[256,  64,  16,    4, 1],
                             length_keys=[0]),
    #初始化权重（为每一批的每一节点（推特）初始化一个为1的权重）
    trax.data.AddLossWeights()
  )

#执行管道函数
train_batches_stream = data_pipeline(train_stream)
eval_batches_stream = data_pipeline(eval_stream)

#获取一批样本
#样本第一个矩阵是各推特的token化后的向量表示，第二个向量是每条推特的label（真实情感值），
#第三个向量是初始化的权重（全为1）
example_batch = next(train_batches_stream)
print(example_batch)

#查看shape是否对应
print(f'shapes = {[x.shape for x in example_batch]}')  


#输出：
#(array([[3032, 2791,  136, ...,    0,    0,    0],
#       [ 428,  663, 3306, ...,    0,    0,    0],
#       [4047,    2,    4, ...,    0,    0,    0],
#       [8180,    2,   28, ...,    0,    0,    0]]), array([1, 0, 0, 0], dtype=int64), 
#        array([1., 1., 1., 1.], dtype=float32))
#
#shapes = [(4, 2048), (4,), (4,)]

5、创建NLP模型的第一步：Embedding层的建立

以一个二分类问题为例，推测某个推特是否是积极/消极的。

比如：I am happy.最终应该预测为positive(1)即为正确的

I am sad.最终应该预测为Negative(1)即为正确的

总体的神经网络架构可以参照上图。

其中X1~Xn为每条推特的token化的向量表示。（这个时候X是无意义的向量，必须转化为Embedding向量，才有意义（可以判断每个单词、推特之间的关系））。

这里如果大家不明白，我下次单独开一个文章进行讲解，这里主要教大家怎么创建。

from trax import layers as tl

# 先创建一个输入向量x（这里可以理解为15个单词的推特）.
x = fastnp.arange(15)
print(f'x = {x}')

# 创建一个embedding层，其中vocab_size指的是词汇表的大小，通常词汇表大小不是乱定的，要根据语料库来计算，当然已经有生成的词汇表最好
#d_feature即dimension维度，这个可以自己确定。
embedding = tl.Embedding(vocab_size=20, d_feature=32)

#值得注意的是，embedding层具有可训练的权值weights因此需要初始化一个输入的signature(shape and dtype)
embedding.init(trax.shapes.signature(x))

# 运行这个embedding层 -- y = embedding(x).
y = embedding(x)

#得到的y即为embedding层的数据
print(f'shape of y = {y.shape}')
# print(y)


#输出：
#x = [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
#shape of y = (15, 32)

6、创建神经网络模型

这里创建的模型架构为一个输入层，一个Embedding层和一个输出层

#一个Serial(串行层)指Embedding层、Mean层、Dense层、LogSoftmax层等的集合，
# 当然根据不同神经网络Serial层里可以是不同内容
# 如果使用不同结构的神经网络，主要在这边进行修改
model = tl.Serial(
    #在模型创建中，Embedding层可以概况为一句代码
    tl.Embedding(vocab_size=8192, d_feature=256), #Embedding层，输入层==>Embedding层也是可以训练权重的
    tl.Mean(axis=1),  # Mean层，将推特所有单词的Embedding相加取平均值，得到一个结果（没有权重训练）

    tl.Dense(2),      # 输出层两个节点，Mean层==》最后的输出层也是可以训练权重的
    tl.LogSoftmax()   # log-probabilities得到结果.
)


# 可以打印出model的结构
print(model)

#输出：
#Serial[
#  Embedding_8192_256
#  Mean
#  Dense_2
#  LogSoftmax
#]

7、创建任务、训练任务

在trax中训练需要用

training.Loop（）

容器，其中必不可少的元素为

1、训练模型（model）

2、训练任务（用什么训练样本，采用什么方法）

3、评估任务（用什么评估样本，评估哪些内容（一般为误差、准确度））

4、结果保存路径

代码如下：

train_task = training.TrainTask(
    #加载训练样本
    labeled_data=train_batches_stream,
    #选择交叉熵来计算误差值
    loss_layer=tl.WeightedCategoryCrossEntropy(),
    #学习率
    optimizer=trax.optimizers.Adam(0.01),
    #多少步执行一次评估任务
    n_steps_per_checkpoint=500,
)

# 添加评估任务（也是控制台的输出）.
eval_task = training.EvalTask(
    #加载评估样本
    labeled_data=eval_batches_stream,
    #打印评估样本误差、评估样本精确度
    metrics=[tl.WeightedCategoryCrossEntropy(), tl.WeightedCategoryAccuracy()],
    n_eval_batches=20  # For less variance in eval numbers.
)

# 定义循环任务，保存结果、模型至output_dir文件夹下
output_dir ="E:\\mashine learning\\chengxu\\NLP-own\\trax学习\\model"

training_loop = training.Loop(model,    #1、循环任务使用的模型
                              train_task,#2、循环任务使用的训练任务
                              eval_tasks=[eval_task],#3、循环任务使用的评估任务
                              output_dir=output_dir) #4、结果保存路径

# 执行2000次循环任务（2000batchs）.
training_loop.run(2000)


#输出：
# Step      1: Ran 1 train steps in 0.78 secs
# Step      1: train WeightedCategoryCrossEntropy |  1.33800304
# Step      1: eval  WeightedCategoryCrossEntropy |  0.71843582
# Step      1: eval      WeightedCategoryAccuracy |  0.56562500
# 
# Step    500: Ran 499 train steps in 5.77 secs
# Step    500: train WeightedCategoryCrossEntropy |  0.62914723
# Step    500: eval  WeightedCategoryCrossEntropy |  0.49253047
# Step    500: eval      WeightedCategoryAccuracy |  0.74062500
# 
# Step   1000: Ran 500 train steps in 5.03 secs
# Step   1000: train WeightedCategoryCrossEntropy |  0.42949259
# Step   1000: eval  WeightedCategoryCrossEntropy |  0.35451687
# Step   1000: eval      WeightedCategoryAccuracy |  0.83750000
# 
# Step   1500: Ran 500 train steps in 4.80 secs
# Step   1500: train WeightedCategoryCrossEntropy |  0.41843575
# Step   1500: eval  WeightedCategoryCrossEntropy |  0.35207348
# Step   1500: eval      WeightedCategoryAccuracy |  0.82109375
# 
# Step   2000: Ran 500 train steps in 5.35 secs
# Step   2000: train WeightedCategoryCrossEntropy |  0.38129005
# Step   2000: eval  WeightedCategoryCrossEntropy |  0.33760912
# Step   2000: eval      WeightedCategoryAccuracy |  0.85312500

8、测试任意一个样本

#得到一个推特的向量
example_input = next(eval_batches_stream)[0][0]
#反令牌化，得到正确的推特
example_input_str = trax.data.detokenize(example_input, vocab_dir=".//",vocab_file='en_8k.subword')
print(f'example input_str: {example_input_str}')
#执行一次正向传播过程
sentiment_log_probs = model(example_input[None, :])  # Add batch dimension.
#得到评估结果
print(f'Model returned sentiment probabilities: {fastnp.exp(sentiment_log_probs)}')

#输出：
#example input_str: Toy Soldiers is an okay action movie but what really stands out is the amount of effort that the scriptwriters and director put into portraying American counter-terrorist forces accurately. Just check out the end credits--there are more than a dozen US military officers and officials listed. The movie accurately portrays the FBI as having control of the hostage situation but turning it over the US Army's Delta Force (who are unnamed in the movie as the Pentagon was still denying their existence at this time) once the President waived the Posse Commitatus Act of US Code. The US Army forces at the end are accurately dressed and armed for the time. And even the use of an AH-64 Apache for air support--which might seem a bit over the top, is not terribly unrealistic. Far more expensive and frankly better movies have portrayed American counter-terrorist forces with far less accuracy.
#Model returned sentiment probabilities: [[0.74171495 0.25828496]]

9、总结一下代码：

import trax
from trax.fastmath import numpy as fastnp
trax.fastmath.use_backend('jax')  # 选择使用的后台'jax' 或 'tensorflow-numpy'.
from trax import layers as tl
from trax.supervised import training

#定义全局变量
model=""

#1、数据处理
def data_load():

    # 加载样本
    # 这里创建两个stream是因为要同时为两个task使用，一个是训练样本，一个是测试样本
    train_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=True)()
    eval_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=False)()
    print(next(train_stream))  # 看一个样本.

    # 使用trax.data模块，您可以创建输入处理管道
    data_pipeline = trax.data.Serial(
        # 令牌化data（把单词变成数组形式），这步中需要用到词汇表'en_8k.subword'
        trax.data.Tokenize(vocab_dir='.\\',vocab_file='en_8k.subword', keys=[0]),

        #打乱data的排序
        trax.data.Shuffle(),
        #过滤
        trax.data.FilterByLength(max_length=2048, length_keys=[0]),
        #数据分批、同个大小等级的数据分到一个桶中
        trax.data.BucketByLength(boundaries=[  32, 128, 512, 2048],
                                 batch_sizes=[256,  64,  16,    4, 1],
                                 length_keys=[0]),
        #初始化权重（为每一批的每一节点（推特）初始化一个为1的权重）
        trax.data.AddLossWeights()
      )
    #执行管道函数
    train_batches_stream = data_pipeline(train_stream)
    eval_batches_stream = data_pipeline(eval_stream)
    example_batch = next(train_batches_stream)
    print(example_batch)
    print(f'shapes = {[x.shape for x in example_batch]}')  #查看shape是否对应
    print("--"*100)
    return train_batches_stream,eval_batches_stream


#2、定义loop容器中必须的四个成员
def loop_task(train_batches_stream, eval_batches_stream):
    #1、定义模型结构
    model = tl.Serial(

        tl.Embedding(vocab_size=8192, d_feature=256),  # Embedding层，输入层==>Embedding层也是可以训练权重的
        tl.Mean(axis=1),  # Mean层，将推特所有单词的Embedding相加取平均值，得到一个结果（没有权重训练）

        tl.Dense(2),  # 输出层两个节点，Mean层==》最后的输出层也是可以训练权重的
        tl.LogSoftmax()  # log-probabilities得到结果.
    )
    # 可以打印出model的结构
    print(model)

    # 2、添加训练任务.
    train_task = training.TrainTask(
        #加载训练样本
        labeled_data=train_batches_stream,
        #选择交叉熵来计算误差值
        loss_layer=tl.WeightedCategoryCrossEntropy(),
        #学习率
        optimizer=trax.optimizers.Adam(0.01),
        #多少步执行一次评估任务
        n_steps_per_checkpoint=500,
    )

    # 3、添加评估任务（也是控制台的输出）.
    eval_task = training.EvalTask(
        #加载评估样本
        labeled_data=eval_batches_stream,
        #打印评估样本误差、评估样本精确度
        metrics=[tl.WeightedCategoryCrossEntropy(), tl.WeightedCategoryAccuracy()],
        n_eval_batches=20  # For less variance in eval numbers.
    )

    # 4、定义循环任务，保存结果、模型至output_dir文件夹下
    output_dir ="E:\\mashine learning\\chengxu\\NLP-own\\trax学习\\model"

    return model,train_task,eval_task,output_dir

#3、开始训练
def loop_train(model, train_task, eval_task, output_dir,steps=2000):
    training_loop = training.Loop(model,  # 1、循环任务使用的模型
                                  train_task,  # 2、循环任务使用的训练任务
                                  eval_tasks=[eval_task],  # 3、循环任务使用的评估任务
                                  output_dir=output_dir)  # 4、结果保存路径

    # 执行2000步任务（2000batchs）.
    training_loop.run(steps)

#4、测试任意一个样本
def test_one(eval_batches_stream,model):
    # 得到一个推特的向量
    example_input = next(eval_batches_stream)[0][0]
    # 反令牌化，得到正确的推特
    example_input_str = trax.data.detokenize(example_input, vocab_dir=".//", vocab_file='en_8k.subword')
    print(f'example input_str: {example_input_str}')
    # 执行一次正向传播过程
    sentiment_log_probs = model(example_input[None, :])  # Add batch dimension.
    # 得到评估结果
    print(f'Model returned sentiment probabilities: {fastnp.exp(sentiment_log_probs)}')

if __name__ == '__main__':
    #1、数据加载与处理
    train_batches_stream, eval_batches_stream = data_load()

    #2、定义训练所需的四个成员
    model, train_task, eval_task, output_dir=loop_task(train_batches_stream, eval_batches_stream)

    #3、开始训练
    loop_train(model, train_task, eval_task, output_dir,2000)

    #4、测试一个样本
    test_one(eval_batches_stream,model)

本人只是一个新手，可能很多地方会有错误，大佬们轻喷！

后面应该还会继续学习，今天就到这里！

听话_让我抗抗

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
4
评论
第一天：NLP初学者-trax库学习

官网对于trax的介绍：Traxis an end-to-end library for deep learning that focuses on clear code and speed. It is actively used and maintained in theGoogle Brain team.总结就是Trax库用以机器学习，速度快、代码清晰，由谷歌团队维护。最近看了吴恩达团队NLP的课，开始接触到trax库，但是奈何找不到很好的中文资源（可能是我太菜了，没找到）。这里...
复制链接

扫一扫