Magpie文本分类

最新推荐文章于 2024-08-26 07:35:33 发布

大太阳小白

最新推荐文章于 2024-08-26 07:35:33 发布

阅读量4.3k

点赞数 2

文章标签： keras python 深度学习文本分类 magpie

本文链接：https://blog.csdn.net/weixin_41579863/article/details/79718120

版权

今天看到一个有意思的东西magpie，用来做文本的多分类，手痒就拿来跑一跑...

首先是项目地址

Git代码地址： https://github.com/inspirehep/magpie

项目介绍

magpie是在keras基础之上封装的深度学习工具，对已标注分类的语料进行训练，预测未标注的文本。例如：一些新闻语料，【军事】【娱乐】【房地产】等等，通过上述已知分类的新闻语料喂给程序，训练出一个模型，下次你拿一个新闻给他预测，它告诉你这个新闻属于哪个分类的概率。当然此问题使用Word2vec、fastText、textCNN等都可以做到。

用到的第三方库

查看setup.py文件可以看到使用到的第三方库，可以根据需要调整。

install_requires=[
        'nltk~=3.2',
        'numpy~=1.12',
        'scipy~=0.18',
        'gensim~=0.13',
        'scikit-learn~=0.18',
        'keras~=2.0',
        'h5py~=2.6',
    ],

如果没有相关环境可以使用官方提供的方法安装。

pip install git+https://github.com/inspirehep/magpie.git@v2.0

或是下载工程，使用setup.py 安装。本人第三库已经很多了已包含上述库。我没有把magpie安装到python环境里，直接把magpie模块文件夹copy到了自己的工程中。

捋一下训练数据

分类：cat data/hep-categories.labels ，一个有5个分类

Astrophysics
Experiment-HEP
Gravitation and Cosmology
Phenomenology-HEP
Theory-HEP

训练数据： data/hep-categories下，每个数据包含.lab分类标签和.txt一个数据两个文件例如：

.lab文件内容如下，标记了两个类别

Astrophysics
Theory-HEP

.txt文件内容如下，上述类别的文本数据

DBI Inflation in N=1 Supergravity
It was recently demonstrated that, ....(文本很多已省略）

统计了一下共有2000个文件，也就是1000条标注数据

训练

from magpie import Magpie
magpie = Magpie()
magpie.train_word2vec('data/hep-categories', vec_dim=100) #训练一个word2vec
magpie.fit_scaler('data/hep-categories') #生成scaler
magpie.init_word_vectors('data/hep-categories', vec_dim=100) #初始化词向量
labels = ['Gravitation and Cosmology', 'Experiment-HEP', 'Theory-HEP'] #定义所有类别
magpie.train('data/hep-categories', labels, test_ratio=0.2, epochs=5) #训练，20%数据作为测试数据，5轮
print magpie.predict_from_text('Stephen Hawking studies black holes')#预测

训练结果

Epoch 5/5
 64/800 [=>............................] - ETA: 4s - loss: 0.3070 - top_k_categorical_accuracy: 1.0000
128/800 [===>..........................] - ETA: 3s - loss: 0.3598 - top_k_categorical_accuracy: 1.0000
192/800 [======>.......................] - ETA: 3s - loss: 0.3554 - top_k_categorical_accuracy: 1.0000
256/800 [========>.....................] - ETA: 2s - loss: 0.3560 - top_k_categorical_accuracy: 1.0000
320/800 [===========>..................] - ETA: 2s - loss: 0.3428 - top_k_categorical_accuracy: 1.0000
384/800 [=============>................] - ETA: 2s - loss: 0.3444 - top_k_categorical_accuracy: 1.0000
448/800 [===============>..............] - ETA: 1s - loss: 0.3419 - top_k_categorical_accuracy: 1.0000
512/800 [==================>...........] - ETA: 1s - loss: 0.3331 - top_k_categorical_accuracy: 1.0000
576/800 [====================>.........] - ETA: 1s - loss: 0.3408 - top_k_categorical_accuracy: 1.0000
640/800 [=======================>......] - ETA: 0s - loss: 0.3402 - top_k_categorical_accuracy: 1.0000
704/800 [=========================>....] - ETA: 0s - loss: 0.3420 - top_k_categorical_accuracy: 1.0000
768/800 [===========================>..] - ETA: 0s - loss: 0.3437 - top_k_categorical_accuracy: 1.0000
800/800 [==============================] - 4s - loss: 0.3423 - top_k_categorical_accuracy: 1.0000 - val_loss: 0.3859 - val_top_k_categorical_accuracy: 1.0000
[('Gravitation and Cosmology', 0.99533552), ('Theory-HEP', 0.64760602), ('Experiment-HEP', 0.1928149)]

模型保存

magpie.save_word2vec_model('/save/my/embeddings/here')
magpie.save_scaler('/save/my/scaler/here', overwrite=True)
magpie.save_model('/save/my/model/here.h5')

模型加载

magpie = Magpie(
    keras_model='/save/my/model/here.h5',
    word2vec_model='/save/my/embeddings/here',
    scaler='/save/my/scaler/here',
    labels=['cat', 'dog', 'cow']
)

遇到的问题

1、报错：You are using pip version 8.1.1, however version 9.0.1 is available.

You should consider upgrading via the 'python -m pip install --upgrade pip' command.

FIX：python -m pip install --upgrade pip

2、报错：type object 'NewBase' has no attribute 'is_abstract'

FIX: 升级six pip install six --upgrade 可能发现版本不变，可以跟踪six调用的是哪个path的模块剔除掉

或是

import sys
sys.path.remove('/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python')

3、如何切换keras后端

vim ~/.keras/keras.json
{"image_data_format":"channels_last","epsilon":1e-07,"floatx":"float32","backend":"tensorflow"}

修改backend值：tensorflow或是theano

大太阳小白

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
8
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫