今天看到一个有意思的东西magpie,用来做文本的多分类,手痒就拿来跑一跑...
- 首先是项目地址
Git代码地址 : https://github.com/inspirehep/magpie
- 项目介绍
magpie是在keras基础之上封装的深度学习工具,对已标注分类的语料进行训练,预测未标注的文本。例如:一些新闻语料,【军事】【娱乐】【房地产】等等,通过上述已知分类的新闻语料喂给程序,训练出一个模型,下次你拿一个新闻给他预测,它告诉你这个新闻属于哪个分类的概率。当然此问题使用Word2vec、fastText、textCNN等都可以做到。
- 用到的第三方库
查看setup.py文件可以看到使用到的第三方库,可以根据需要调整。
install_requires=[
'nltk~=3.2',
'numpy~=1.12',
'scipy~=0.18',
'gensim~=0.13',
'scikit-learn~=0.18',
'keras~=2.0',
'h5py~=2.6',
],
如果没有相关环境可以使用官方提供的方法安装。
pip install git+https://github.com/inspirehep/magpie.git@v2.0
或是下载工程,使用setup.py 安装。本人第三库已经很多了已包含上述库。我没有把magpie安装到python环境里,直接把magpie模块文件夹copy到了自己的工程中。
- 捋一下训练数据
分类:cat data/hep-categories.labels ,一个有5个分类
Astrophysics
Experiment-HEP
Gravitation and Cosmology
Phenomenology-HEP
Theory-HEP
训练数据: data/hep-categories下,每个数据包含.lab分类标签和.txt一个数据两个文件例如:
.lab文件内容如下,标记了两个类别
Astrophysics
Theory-HEP
.txt文件内容如下,上述类别的文本数据
DBI Inflation in N=1 Supergravity
It was recently demonstrated that, ....(文本很多已省略)
统计了一下共有2000个文件,也就是1000条标注数据
- 训练
from magpie import Magpie
magpie = Magpie()
magpie.train_word2vec('data/hep-categories', vec_dim=100) #训练一个word2vec
magpie.fit_scaler('data/hep-categories') #生成scaler
magpie.init_word_vectors('data/hep-categories', vec_dim=100) #初始化词向量
labels = ['Gravitation and Cosmology', 'Experiment-HEP', 'Theory-HEP'] #定义所有类别
magpie.train('data/hep-categories', labels, test_ratio=0.2, epochs=5) #训练,20%数据作为测试数据,5轮
print magpie.predict_from_text('Stephen Hawking studies black holes')#预测
- 训练结果
Epoch 5/5
64/800 [=>............................] - ETA: 4s - loss: 0.3070 - top_k_categorical_accuracy: 1.0000
128/800 [===>..........................] - ETA: 3s - loss: 0.3598 - top_k_categorical_accuracy: 1.0000
192/800 [======>.......................] - ETA: 3s - loss: 0.3554 - top_k_categorical_accuracy: 1.0000
256/800 [========>.....................] - ETA: 2s - loss: 0.3560 - top_k_categorical_accuracy: 1.0000
320/800 [===========>..................] - ETA: 2s - loss: 0.3428 - top_k_categorical_accuracy: 1.0000
384/800 [=============>................] - ETA: 2s - loss: 0.3444 - top_k_categorical_accuracy: 1.0000
448/800 [===============>..............] - ETA: 1s - loss: 0.3419 - top_k_categorical_accuracy: 1.0000
512/800 [==================>...........] - ETA: 1s - loss: 0.3331 - top_k_categorical_accuracy: 1.0000
576/800 [====================>.........] - ETA: 1s - loss: 0.3408 - top_k_categorical_accuracy: 1.0000
640/800 [=======================>......] - ETA: 0s - loss: 0.3402 - top_k_categorical_accuracy: 1.0000
704/800 [=========================>....] - ETA: 0s - loss: 0.3420 - top_k_categorical_accuracy: 1.0000
768/800 [===========================>..] - ETA: 0s - loss: 0.3437 - top_k_categorical_accuracy: 1.0000
800/800 [==============================] - 4s - loss: 0.3423 - top_k_categorical_accuracy: 1.0000 - val_loss: 0.3859 - val_top_k_categorical_accuracy: 1.0000
[('Gravitation and Cosmology', 0.99533552), ('Theory-HEP', 0.64760602), ('Experiment-HEP', 0.1928149)]
- 模型保存
magpie.save_word2vec_model('/save/my/embeddings/here')
magpie.save_scaler('/save/my/scaler/here', overwrite=True)
magpie.save_model('/save/my/model/here.h5')
- 模型加载
magpie = Magpie(
keras_model='/save/my/model/here.h5',
word2vec_model='/save/my/embeddings/here',
scaler='/save/my/scaler/here',
labels=['cat', 'dog', 'cow']
)
遇到的问题
1、报错:You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
FIX:python -m pip install --upgrade pip
2、报错:type object 'NewBase' has no attribute 'is_abstract'
FIX: 升级six pip install six --upgrade 可能发现版本不变,可以跟踪six调用的是哪个path的模块剔除掉
或是
import sys
sys.path.remove('/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python')
3、如何切换keras后端
vim ~/.keras/keras.json
{"image_data_format":"channels_last","epsilon":1e-07,"floatx":"float32","backend":"tensorflow"}
修改backend值:tensorflow或是theano