因为是.ipynb,我是直接开cmd来运行了.
首先是import mathzoo和pandas,用中文的镜像站来下载,不然下载真的等到死还下不完
补充一下 用pip install 下载的matchzoo 可能会有模块用不了 (比如我的processor_units)建议手动装matchzoo库 XD
pip install -i http://pypi.douban.com/simple/ matchzoo
pip install -i http://pypi.douban.com/simple/ pandas
import matchzoo as mz
import pandas as pd
接下来是这个,我运行时候遇到了FileNotFoundError:train.csv’ does not exist的错误
(其实我当时直接去github手动装库就没那么多事了 ;D )
data_pack = mz.datasets.toy.load_data()
解决方法是把toy文件夹丢到 目录:C:\Users\你的用户名\AppData\Local\Programs\Python\Python37\Lib\site-packages\matchzoo\datasets里面去,
toy文件夹在github的MatchZoo / matchzoo / datasets /目录下面
data_pack.left.head()
运行结果
train.csv可以用EXCEL打开
![](https://i-blog.csdnimg.cn/blog_migrate/2872eda5a6a79ee29611c8a18837abfd.png)
发现left.head()显示的是文档里的B,C两栏的头几行,把重复的行的省略了
data_pack.right.head()
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/53d8aeec2b341c72177968106dbff244.png)
同理right.head()显示的是文档里的D,E两栏的头几行
data_pack.relation.head()
运行结果
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/f808f3750559cac0911ad36d63c6e969.png)
relation.head() 显示”关系表格“的头几个
data_pack.frame().head()
运行结果
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/6437758ceb247534195a53a6a06869a5.png)
frame().head() 显示完整的结构表的头几个
type(data_pack.frame)
运行结果
frame = data_pack.frame
data_pack.relation['label'] = data_pack.relation['label'] + 1
frame().head()
运行结果
把label那一栏的值全部+1
数据集切片(雾)
data_slice = data_pack[5:10]
data_slice.relation
运行结果
显示了5到10行
data_slice.left
data_slice.right
显示切片的left栏和right栏
data_pack.frame[5:10]
显示切片frame 基本大同小异了
data_slice.frame() == data_pack.frame[5:10]
说明通过data_pack[5:10]来切片的frame的结果和data_pack.frame[5:10]是一样的
num_train = int(len(data_pack) * 0.8)
data_pack.shuffle(inplace=True)
train_slice = data_pack[:num_train]
test_slice = data_pack[num_train:]
data_pack的长度是100刚好就train.csv数据的行数
print (int(len(data_pack)))
100
data_pack.shuffle(inplace=True) 打乱顺序
通过改变关系列来改变数据包的顺序。
(inplace=True)直接对原dataFrame进行操作
train_slice 是data_pack起始到80
test_slice 是 data_pack 80 到结尾
(没看懂要干嘛这里?)
加入文本长度栏
data_slice.apply_on_text(len).frame()
data_slice.apply_on_text(len, rename=('left_length', 'right_length')).frame()
data_slice.append_text_length().frame()
下面是分别运行结果
(这里data_slice.apply_on_text(len).frame()为什么没变化?)
‘left_length’, 'right_length’的栏加入表头中
data_slice.apply_on_text(len, rename=(‘left_length’, ‘right_length’)).frame()和
data_slice.append_text_length().frame()相等(大概?)
one_hot_encode_label
data_pack.relation['label'] = data_pack.relation['label'].astype(int)
data_pack.one_hot_encode_label(num_classes=3).frame().head()
one_hot_encode_label(num_classes=3)
num_classes=3 3改成4 label显示出来就[0, 1, 0, 0]
那么有啥用?
-2019.8.23 用处是做label,会用于反向传播,优化参数。num_classes是分类属于的第几类。
data = pd.DataFrame({
'text_left': list('ARSAARSA'),
'text_right': list('arstenus')
})
my_pack = mz.pack(data)
my_pack.frame()
用pandas来建立数据集(雾)
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/a503c681fa9e777df85001afca1996a7.png)
x, y = data_pack[:3].unpack()
数据集的解压(雾)
将data_pack前三条解压
x就是 id_left ,text_left, id_right, text_right 组成的元组(好像是元组吧?)
y是label的值组成的数组
mz.datasets.list_available()
返回list
显示可获得的数据集
toy_train_rank = mz.datasets.toy.load_data()
toy_train_rank.frame().head()
加载toy数据集
toy_dev_classification, classes = mz.datasets.toy.load_data(stage='train', task='classification')
toy_dev_classification.frame().head()
加载toy数据集
stage:train
mz.datasets.toy.load_data()的两个参数
stage – One of train, dev, and test. (default: train)
阶段——训练、开发和测试的一个阶段。(默认值:训练)
task – Could be one of ranking, classification or a matchzoo.engine.BaseTask instance. (default: ranking)
任务——可以是排名、分类或matchzoo.engine.basetask实例之一。(默认值:排名)
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/cf4c859072d9b0c6236aeab389b974db.png)
toy_dev_classification是个DataPack object
classes:
wiki_dev_entailment_rank = mz.datasets.wiki_qa.load_data(stage='dev')
wiki_dev_entailment_rank.frame().head()
加载wiki_qa数据集
stage:dev
snli_test_classification, classes = mz.datasets.snli.load_data(stage='test', task='classification')
snli_test_classification.frame().head()
加载snli训练集
stage:test
我这里报了个 FileNotFoundError: .matchzoo\datasets\snli\snli_1.0\snli_1.0_test.txt’ 的错误,要自己手动加snli_1.0_test.txt
这里给出SNLI数据集的下载:
https://www.nyu.edu/projects/bowman/multinli/snli_1.0.zip
把文件夹snli_1.0丢到C:\Users\你的用户ID\.matchzoo\datasets\snli\里面就好了
classes:
[‘entailment’, ‘contradiction’, ‘neutral’, ‘-’]
[‘蕴涵’、‘矛盾’、‘中性’、‘-’]
这里只是给出三个stage的具体的实例
简单看下数据集
没有别的其他意义?
mz.preprocessors.list_available()
可获得预处理器
toy数据集训练实例
preprocessor = mz.models.Naive.get_default_preprocessor()
matchzoo.models.naive模块
Naive model with a simplest structure for testing purposes.
简单的模型,具有最简单的测试结构。
get_default_preprocessor() get a default preprocessor
train_raw = mz.datasets.toy.load_data('train', 'ranking')
test_raw = mz.datasets.toy.load_data('test', 'ranking')
加载toy数据集 train&test
进行排序任务
preprocessor.fit(train_raw)
fit():转换的预处理上下文
preprocessor.context
preprocessor.context结果:
train_preprocessed = preprocessor.transform(train_raw)
test_preprocessed = preprocessor.transform(test_raw)
preprocessor.transform(train_raw)结果:
preprocessor.transform(test_raw)结果:
transform:
对数据应用转换,创建三字母表示
返回: Transformed data as DataPack object.
model = mz.models.Naive()
model.guess_and_fill_missing_params()
model.build()
model.compile()
创建Naive model
x_train, y_train = train_preprocessed.unpack()
model.fit(x_train, y_train)
x_test, y_test = test_preprocessed.unpack()
model.evaluate(x_test, y_test)
将训练集和测试集的Transformed data解压
分别fit ,evaluate data
model.fit(x_train, y_train)结果:
model.evaluate(x_test, y_test)结果:
data_pack = mz.datasets.toy.load_data()
data_pack.frame().head()
加载toy数据集
processor_units.TokenizeUnit()
tokenizer = mz.processor_units.TokenizeUnit()
data_pack.apply_on_text(tokenizer.transform, inplace=True)
data_pack.frame[:5]
报AttributeError: module ‘matchzoo’ has no attribute 'processor_units’的错是因为用pip install的matchzoo比较旧了,要自己手动去github下载最新的matchzoo文件夹扔进
C:\Users\你的用户名\AppData\Local\Programs\Python\Python37\Lib\site-packages\matchzoo里面
TokenizeUnit顾名思义就是切分词了
data_pack.frame[:5]结果:
processor_units.LowercaseUnit()
lower_caser = mz.processor_units.LowercaseUnit()
data_pack.apply_on_text(lower_caser.transform, inplace=True)
data_pack.frame[:5]
LowercaseUnit顾名思义就是将文本转小写
data_pack.frame[:5]结果:
处理单元的连携
data_pack = mz.datasets.toy.load_data()
chain = mz.chain_transform([mz.processor_units.TokenizeUnit(),
mz.processor_units.LowercaseUnit()])
data_pack.apply_on_text(chain, inplace=True)
data_pack.frame[:5]
这里同时切分+转小写
data_pack.frame[:5]结果:
mz.processor_units.VocabularyUnit.__base__
显示结果:
<class ‘matchzoo.processor_units.processor_units.StatefulProcessorUnit’>
vocab_unit = mz.processor_units.VocabularyUnit()
texts = data_pack.frame()[['text_left', 'text_right']]
all_tokens = texts.sum().sum()
vocab_unit.fit(all_tokens)
vocab_unit = mz.processor_units.VocabularyUnit():创建VocabularyUnit
texts = data_pack.frame()[[‘text_left’, ‘text_right’]]:取text_left,text_right两栏 返回.frame()的格式
all_tokens = texts.sum().sum() 结果:
texts.sum().sum()是应该是left,right text的所有词组成的一个列表
vocab_unit.fit(all_tokens):fit数据集
for vocab in 'how', 'are', 'glacier':
print(vocab, vocab_unit.state['term_index'][vocab])
vocab_unit.state[‘term_index’][vocab] :返回出现vocab单词次数
如图:
data_pack.apply_on_text(vocab_unit.transform, inplace=True)
data_pack.frame()[:5]
vocab_unit.transform 某种版本的词转id吧…
data_pack.frame()[:5]结果:
data_pack = mz.datasets.toy.load_data()
vocab_unit = mz.build_vocab_unit(data_pack)
data_pack.apply_on_text(vocab_unit.transform).frame[:5]
mz.build_vocab_unit(data_pack)
建立一个处理器单元。data_pack给定的数据包。
返回: A built vocabulary unit.
data_pack.apply_on_text(vocab_unit.transform).frame[:5]
vocab_unit.transform is the function to apply
将每个词转成id,一个词对应一个id
data = mz.datasets.toy.load_data()
preprocessor = mz.preprocessors.DSSMPreprocessor(with_word_hashing=False)
data = preprocessor.fit_transform(data, verbose=0)
preprocessor = mz.preprocessors.DSSMPreprocessor(with_word_hashing=False)
创建DSSM预处理器,word hashing关
data = preprocessor.fit_transform(data, verbose=0)
调用fit transform
verbose:日志显示
verbose = 0 为不在标准输出流输出日志信息
verbose = 1 为输出进度条记录
verbose = 2 为每个epoch输出一行记录
model = mz.models.DSSM()
model.params['input_shapes'] = preprocessor.context['input_shapes']
model.guess_and_fill_missing_params(verbose=0)
model.build()
model.compile()
input_shapes:输入形状(雾)
取决于模型和数据。应手动设置。
Dependent on the model and data. Should be set manually.
model.guess_and_fill_missing_params(verbose=0)
Use this method to automatically fill-in hyper parameters. This involves some guessing so the parameter it fills could be wrong. For example, the default task is Ranking, and if we do not set it to Classification manaully for data packs prepared for classification, then the shape of the model output and the data will mismatch.
使用此方法自动填充超参数。这涉及一些猜测,因此它填充的参数可能是错误的。例如,默认的任务是排序,如果我们没有为准备分类的数据包将其设置为手动分类,那么模型输出和数据的形状将不匹配。
猜测并填充参数中缺少的参数。
model.build()字面意思
model.compile()字面意思
term_index = preprocessor.context['vocab_unit'].state['term_index']
hashing_unit = mz.processor_units.WordHashingUnit(term_index)
data_generator = mz.DynamicDataGenerator(hashing_unit.transform, data, batch_size=1)
model.fit_generator(data_generator)
preprocessor.context[‘vocab_unit’].state[‘term_index’]结果:
mz.processor_units.WordHashingUnit(term_index):进行WordHash
返回结果:
mz.DynamicDataGenerator(hashing_unit.transform, data, batch_size=1):
DynamicDataGenerator():Data generator with preprocess unit inside.内部带有预处理单元的数据发生器。
fit_generator():
见Keras.Models.Fit−Generator(……)for more details.(这官方文档这样写的…)
model.fit_generator(data_generator)结果: