MatchZoo简单使用
简介
最近在调研文本相似度计算方法的时候,突然看到有关MatchZoo有关的内容,MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型,貌似比较好玩,于是乎,看了一下MatchZoo的使用方法。在此简单记录一下我的使用过程,运行totorials里面的示例代码。如果之前没接触过MatchZoo,直接看github中README.md中的示例代码可能有点云里雾里,不过等看过totorial中的代码之后,才发现,那是MatchZoo的精髓。我在使用的时候是2019.11,随着版本的跟新,我这篇博客可能也就过时了,因为在我查找资料的时候,很多博客中讲解的已经与我实际测试的内容大相径庭。
安装
在github上也提到了,对于MatchZoo中包含了两种安装方式:
- Pypi 安装
pip install matchzoo
- github source 安装
git clone https://github.com/NTMC-Community/MatchZoo.git cd MatchZoo python setup.py install
模型
在模型中,目前给出了很多模型。但是对于不同的模型,由于做了封装,使用起来比较简单,主要分两步:第一,创建模型;第二,创建模型参数。这个在接下来的步骤中可以比较清楚的看到。
运行Quick Start
- 切换到MatchZoo-master\tutorials目录下(我是源码安装的,因此会有这个文件夹,pypi安装时,也会有,不过路径不是这个,可以直接搜quick_start.ipynb,去找当前文件的位置)。运行一下命令,启动jupyter notebook
如果,机器上没有安装jupyter或者之前没有接触过,可参考jupyter notebook简介jupyter notebook
- 流程
a. 定义任务
b. 准备数据
c. 数据预处理
d. 创建模型
e. 训练评估
f. 预测
下面的这段代码则是展示了,使用matchzoo运行的简单的demo,包含了上面的demo(这个demo是从教程里抽取出来的).import matchzoo as mz print(mz.__version__) ### 定义任务,包含两种,一个是Ranking,一个是classification task = mz.tasks.Ranking() print(task) ### 准备数据,数据在源码中有,不确定在pip安装的是否存在 ### train_raw是matchzoo中自定的数据格式 matchzoo.data_pack.data_pack.DataPack train_raw = mz.datasets.toy.load_data(stage='train', task=task) test_raw = mz.datasets.toy.load_data(stage='test', task=task) ### 数据预处理,BasicPreprocessor为指定预处理的方式,在预处理中包含了两步:fit,transform ### fit将收集一些有用的信息到preprocessor.context中,不会对输入DataPack进行处理 ### transformer 不会改变context、DataPack,他将重新生成转变后的DataPack. ### 在transformer过程中,包含了Tokenize => Lowercase => PuncRemoval等过程,这个过程在方法中应该是可以自定义的 preprocessor = mz.preprocessors.BasicPreprocessor() preprocessor.fit(train_raw) ## init preprocessor inner state. train_processed = preprocessor.transform(train_raw) test_processed = preprocessor.transform(test_raw) ### 创建模型以及修改参数(可以使用mz.models.list_available()查看可用的模型列表) model = mz.models.DenseBaseline() model.params['task'] = task model.params['mlp_num_units'] = 3 model.params.update(preprocessor.context) model.params.completed() model.build() model.compile() model.backend.summary() ### 训练, 评估, 预测 x, y = train_processed.unpack() test_x, test_y = test_processed.unpack() model.fit(x , y,batch_size=32, epochs=5) model.evaluate(test_x,test_y) model.predict(test_x) ### 保存模型 model.save('my-model') loaded_model = mz.load_model('my-model')
- 通过jupyter notebook展示运行效果
示例中提供的代码不止包含上面的内容,还包含了数据结构,不同的读取方式,为了方便观察,我把我这边使用jupyter notebook运行代码的效果展示出来,同样的,大家也可以运行一下试试。
MatchZoo Quick Start
import matchzoo as mz
print(mz.__version__)
d:\software\python\python37\lib\site-packages\tqdm\std.py:651: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
from pandas import Panel
Using TensorFlow backend.
d:\software\python\python37\lib\site-packages\tqdm\std.py:651: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
from pandas import Panel
2.2.0
Define Task
There are two types of tasks available in MatchZoo. mz.tasks.Ranking
and mz.tasks.Classification
. We will use a ranking task for this demo.
task = mz.tasks.Ranking()
print(task)
Ranking Task
Prepare Data
train_raw = mz.datasets.toy.load_data(stage='train', task=task)
test_raw = mz.datasets.toy.load_data(stage='test', task=task)
type(train_raw)
matchzoo.data_pack.data_pack.DataPack
DataPack
is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack
is consists of three pandas.DataFrame
:
train_raw.left.head()
text_left | |
---|---|
id_left | |
Q1 | how are glacier caves formed? |
Q2 | How are the directions of the velocity and for... |
Q5 | how did apollo creed die |
Q6 | how long is the term for federal judges |
Q7 | how a beretta model 21 pistols magazines works |
train_raw.right.head()
text_right | |
---|---|
id_right | |
D1-0 | A partly submerged glacier cave on Perito More... |
D1-1 | The ice facade is approximately 60 m high |
D1-2 | Ice formations in the Titlis glacier cave |
D1-3 | A glacier cave is a cave formed within the ice... |
D1-4 | Glacier caves are often called ice caves , but... |
train_raw.relation.head()
id_left | id_right | label | |
---|---|---|---|
0 | Q1 | D1-0 | 0.0 |
1 | Q1 | D1-1 | 0.0 |
2 | Q1 | D1-2 | 0.0 |
3 | Q1 | D1-3 | 1.0 |
4 | Q1 | D1-4 | 0.0 |
It is also possible to convert a DataPack
into a single pandas.DataFrame
that holds all information.
train_raw.frame().head()
id_left | text_left | id_right | text_right | label | |
---|---|---|---|---|---|
0 | Q1 | how are glacier caves formed? | D1-0 | A partly submerged glacier cave on Perito More... | 0.0 |
1 | Q1 | how are glacier caves formed? | D1-1 | The ice facade is approximately 60 m high | 0.0 |
2 | Q1 | how are glacier caves formed? | D1-2 | Ice formations in the Titlis glacier cave | 0.0 |