matchzoo的tutorials的data_handing.ipynb的运行学习

最新推荐文章于 2021-05-12 12:50:21 发布

likeGhee

最新推荐文章于 2021-05-12 12:50:21 发布

阅读量936

点赞数

本文链接：https://blog.csdn.net/qq_19841133/article/details/89458793

版权

因为是.ipynb，我是直接开cmd来运行了.
首先是import mathzoo和pandas,用中文的镜像站来下载，不然下载真的等到死还下不完
补充一下用pip install 下载的matchzoo 可能会有模块用不了（比如我的processor_units）建议手动装matchzoo库 XD

pip install -i http://pypi.douban.com/simple/ matchzoo
pip install -i http://pypi.douban.com/simple/ pandas
import matchzoo as mz
import pandas as pd

接下来是这个，我运行时候遇到了FileNotFoundError:train.csv’ does not exist的错误
（其实我当时直接去github手动装库就没那么多事了 ;D ）

data_pack = mz.datasets.toy.load_data()

解决方法是把toy文件夹丢到目录：C:\Users\你的用户名\AppData\Local\Programs\Python\Python37\Lib\site-packages\matchzoo\datasets里面去,
toy文件夹在github的MatchZoo / matchzoo / datasets /目录下面
在这里插入图片描述

 data_pack.left.head()

运行结果
在这里插入图片描述
train.csv可以用EXCEL打开

发现left.head()显示的是文档里的B,C两栏的头几行，把重复的行的省略了

data_pack.right.head()

同理right.head()显示的是文档里的D,E两栏的头几行

 data_pack.relation.head()

运行结果

relation.head() 显示”关系表格“的头几个

data_pack.frame().head()

运行结果

frame().head() 显示完整的结构表的头几个

type(data_pack.frame)

运行结果
在这里插入图片描述

frame = data_pack.frame
data_pack.relation['label'] = data_pack.relation['label'] + 1
frame().head()

运行结果
在这里插入图片描述

把label那一栏的值全部+1

数据集切片（雾）

data_slice = data_pack[5:10]
data_slice.relation

运行结果
在这里插入图片描述
显示了5到10行

 data_slice.left
 data_slice.right

显示切片的left栏和right栏
在这里插入图片描述

data_pack.frame[5:10]

在这里插入图片描述
显示切片frame 基本大同小异了

 data_slice.frame() == data_pack.frame[5:10]

在这里插入图片描述
说明通过data_pack[5:10]来切片的frame的结果和data_pack.frame[5:10]是一样的

num_train = int(len(data_pack) * 0.8)
data_pack.shuffle(inplace=True)
train_slice = data_pack[:num_train]
test_slice = data_pack[num_train:]

data_pack的长度是100刚好就train.csv数据的行数

 print (int(len(data_pack)))
100

data_pack.shuffle(inplace=True) 打乱顺序
通过改变关系列来改变数据包的顺序。
(inplace=True)直接对原dataFrame进行操作

train_slice 是data_pack起始到80
test_slice 是 data_pack 80 到结尾
(没看懂要干嘛这里？）

加入文本长度栏

data_slice.apply_on_text(len).frame()
data_slice.apply_on_text(len, rename=('left_length', 'right_length')).frame()
data_slice.append_text_length().frame()

下面是分别运行结果

(这里data_slice.apply_on_text(len).frame()为什么没变化？) 在这里插入图片描述
‘left_length’, 'right_length’的栏加入表头中

在这里插入图片描述

data_slice.apply_on_text(len, rename=(‘left_length’, ‘right_length’)).frame()和
data_slice.append_text_length().frame()相等（大概？）

one_hot_encode_label

data_pack.relation['label'] = data_pack.relation['label'].astype(int)
data_pack.one_hot_encode_label(num_classes=3).frame().head()

在这里插入图片描述
one_hot_encode_label(num_classes=3)
num_classes=3 3改成4 label显示出来就[0, 1, 0, 0]
那么有啥用？
-2019.8.23 用处是做label，会用于反向传播，优化参数。num_classes是分类属于的第几类。

data = pd.DataFrame({
    'text_left': list('ARSAARSA'),
    'text_right': list('arstenus')
})
my_pack = mz.pack(data)
my_pack.frame()

用pandas来建立数据集（雾）

x, y = data_pack[:3].unpack()

数据集的解压（雾）

将data_pack前三条解压

在这里插入图片描述
x就是 id_left ,text_left, id_right, text_right 组成的元组（好像是元组吧？）
y是label的值组成的数组

mz.datasets.list_available()

返回list
显示可获得的数据集

在这里插入图片描述

toy_train_rank = mz.datasets.toy.load_data()
toy_train_rank.frame().head()

加载toy数据集
在这里插入图片描述

toy_dev_classification, classes = mz.datasets.toy.load_data(stage='train', task='classification')
toy_dev_classification.frame().head()

加载toy数据集
stage:train

mz.datasets.toy.load_data（）的两个参数

stage – One of train, dev, and test. (default: train)

阶段——训练、开发和测试的一个阶段。（默认值：训练）

task – Could be one of ranking, classification or a matchzoo.engine.BaseTask instance. (default: ranking)

任务——可以是排名、分类或matchzoo.engine.basetask实例之一。（默认值：排名）

toy_dev_classification是个DataPack object

classes：
在这里插入图片描述

wiki_dev_entailment_rank = mz.datasets.wiki_qa.load_data(stage='dev')
wiki_dev_entailment_rank.frame().head()

加载wiki_qa数据集

stage:dev

在这里插入图片描述

snli_test_classification, classes = mz.datasets.snli.load_data(stage='test', task='classification')
snli_test_classification.frame().head()

加载snli训练集

stage:test

我这里报了个 FileNotFoundError: .matchzoo\datasets\snli\snli_1.0\snli_1.0_test.txt’ 的错误，要自己手动加snli_1.0_test.txt

这里给出SNLI数据集的下载：
https://www.nyu.edu/projects/bowman/multinli/snli_1.0.zip
把文件夹snli_1.0丢到C:\Users\你的用户ID\.matchzoo\datasets\snli\里面就好了

在这里插入图片描述
classes：
[‘entailment’, ‘contradiction’, ‘neutral’, ‘-’]

[‘蕴涵’、‘矛盾’、‘中性’、‘-’]
在这里插入图片描述

这里只是给出三个stage的具体的实例
简单看下数据集
没有别的其他意义?

mz.preprocessors.list_available()

可获得预处理器

在这里插入图片描述

toy数据集训练实例

preprocessor = mz.models.Naive.get_default_preprocessor()

matchzoo.models.naive模块

Naive model with a simplest structure for testing purposes.

简单的模型，具有最简单的测试结构。

get_default_preprocessor() get a default preprocessor

train_raw = mz.datasets.toy.load_data('train', 'ranking')
test_raw = mz.datasets.toy.load_data('test', 'ranking')

加载toy数据集 train&test
进行排序任务

preprocessor.fit(train_raw)

fit():转换的预处理上下文

preprocessor.context

preprocessor.context结果：

在这里插入图片描述

train_preprocessed = preprocessor.transform(train_raw)
test_preprocessed = preprocessor.transform(test_raw)

preprocessor.transform(train_raw)结果：
在这里插入图片描述
preprocessor.transform(test_raw)结果：

transform：
对数据应用转换，创建三字母表示
返回: Transformed data as DataPack object.

model = mz.models.Naive()
model.guess_and_fill_missing_params()
model.build()
model.compile()

创建Naive model

在这里插入图片描述

x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)
x_test, y_test = test_preprocessed.unpack()
model.evaluate(x_test, y_test)

将训练集和测试集的Transformed data解压
分别fit ，evaluate data
model.fit(x_train, y_train)结果：
在这里插入图片描述

model.evaluate(x_test, y_test)结果：在这里插入图片描述

data_pack = mz.datasets.toy.load_data()
data_pack.frame().head()

加载toy数据集

processor_units.TokenizeUnit()

tokenizer = mz.processor_units.TokenizeUnit()
data_pack.apply_on_text(tokenizer.transform, inplace=True)
data_pack.frame[:5]

报AttributeError: module ‘matchzoo’ has no attribute 'processor_units’的错是因为用pip install的matchzoo比较旧了，要自己手动去github下载最新的matchzoo文件夹扔进
C:\Users\你的用户名\AppData\Local\Programs\Python\Python37\Lib\site-packages\matchzoo里面

TokenizeUnit顾名思义就是切分词了

data_pack.frame[:5]结果：

在这里插入图片描述

processor_units.LowercaseUnit()

lower_caser = mz.processor_units.LowercaseUnit()
data_pack.apply_on_text(lower_caser.transform, inplace=True)
data_pack.frame[:5]

LowercaseUnit顾名思义就是将文本转小写

data_pack.frame[:5]结果：

在这里插入图片描述

处理单元的连携

data_pack = mz.datasets.toy.load_data()
chain = mz.chain_transform([mz.processor_units.TokenizeUnit(),
                           mz.processor_units.LowercaseUnit()])
data_pack.apply_on_text(chain, inplace=True)
data_pack.frame[:5]

这里同时切分+转小写

data_pack.frame[:5]结果：
在这里插入图片描述

mz.processor_units.VocabularyUnit.__base__

显示结果：
<class ‘matchzoo.processor_units.processor_units.StatefulProcessorUnit’>

vocab_unit = mz.processor_units.VocabularyUnit()
texts = data_pack.frame()[['text_left', 'text_right']]
all_tokens = texts.sum().sum()
vocab_unit.fit(all_tokens)

vocab_unit = mz.processor_units.VocabularyUnit()：创建VocabularyUnit
texts = data_pack.frame()[[‘text_left’, ‘text_right’]]：取text_left，text_right两栏返回.frame()的格式
all_tokens = texts.sum().sum() 结果: 在这里插入图片描述
texts.sum().sum()是应该是left，right text的所有词组成的一个列表

vocab_unit.fit(all_tokens)：fit数据集

for vocab in 'how', 'are', 'glacier':
    print(vocab, vocab_unit.state['term_index'][vocab])

在这里插入图片描述

vocab_unit.state[‘term_index’][vocab] ：返回出现vocab单词次数
如图：
在这里插入图片描述

data_pack.apply_on_text(vocab_unit.transform, inplace=True)
data_pack.frame()[:5]

vocab_unit.transform 某种版本的词转id吧…
data_pack.frame()[:5]结果：
在这里插入图片描述

data_pack = mz.datasets.toy.load_data()
vocab_unit = mz.build_vocab_unit(data_pack)
data_pack.apply_on_text(vocab_unit.transform).frame[:5]

mz.build_vocab_unit(data_pack)

建立一个处理器单元。data_pack给定的数据包。

返回: A built vocabulary unit.
在这里插入图片描述
data_pack.apply_on_text(vocab_unit.transform).frame[:5]
vocab_unit.transform is the function to apply

将每个词转成id，一个词对应一个id

在这里插入图片描述

data = mz.datasets.toy.load_data()
preprocessor = mz.preprocessors.DSSMPreprocessor(with_word_hashing=False)
data = preprocessor.fit_transform(data, verbose=0)

preprocessor = mz.preprocessors.DSSMPreprocessor(with_word_hashing=False)

创建DSSM预处理器，word hashing关

data = preprocessor.fit_transform(data, verbose=0)

调用fit transform

verbose：日志显示
verbose = 0 为不在标准输出流输出日志信息
verbose = 1 为输出进度条记录
verbose = 2 为每个epoch输出一行记录

model = mz.models.DSSM()
model.params['input_shapes'] = preprocessor.context['input_shapes']
model.guess_and_fill_missing_params(verbose=0)
model.build()
model.compile()

input_shapes:输入形状（雾）

取决于模型和数据。应手动设置。
Dependent on the model and data. Should be set manually.

model.guess_and_fill_missing_params(verbose=0)

Use this method to automatically fill-in hyper parameters. This involves some guessing so the parameter it fills could be wrong. For example, the default task is Ranking, and if we do not set it to Classification manaully for data packs prepared for classification, then the shape of the model output and the data will mismatch.
使用此方法自动填充超参数。这涉及一些猜测，因此它填充的参数可能是错误的。例如，默认的任务是排序，如果我们没有为准备分类的数据包将其设置为手动分类，那么模型输出和数据的形状将不匹配。

猜测并填充参数中缺少的参数。

model.build()字面意思
model.compile()字面意思

term_index = preprocessor.context['vocab_unit'].state['term_index']
hashing_unit = mz.processor_units.WordHashingUnit(term_index)
data_generator = mz.DynamicDataGenerator(hashing_unit.transform, data, batch_size=1)

model.fit_generator(data_generator)

preprocessor.context[‘vocab_unit’].state[‘term_index’]结果：
在这里插入图片描述
mz.processor_units.WordHashingUnit(term_index)：进行WordHash
返回结果：

mz.DynamicDataGenerator(hashing_unit.transform, data, batch_size=1)：

DynamicDataGenerator（）：Data generator with preprocess unit inside.内部带有预处理单元的数据发生器。

fit_generator（）：
见Keras.Models.Fit−Generator（……）for more details.（这官方文档这样写的…）
model.fit_generator(data_generator)结果：
在这里插入图片描述

likeGhee

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
matchzoo的tutorials的data_handing.ipynb的运行学习

//写在前面加（雾）的都表示那些命名都是我自己的黑话，标准命名是啥我真的是希腊乃。因为是.ipynb，我是直接开cmd来运行了.首先是import mathzoo和pandas,用中文的镜像站来下载，不然下载真的等到死还下不完pip install -i http://pypi.douban.com/simple/ matchzoopip install -i http://pypi.d...
复制链接

扫一扫