零基础入门NLP_Task4_基于深度学习的文本分类1 学习笔记

Task4 基于深度学习的文本分类1


Author: 2tong
与传统机器学习不同,深度学习既提供特征提取功能,也可以完成分类的功能。
以FastText为例,学习基于深度学习的文本分类。

FastText基本信息

FastText是一种典型的深度学习词向量的表示方法,它非常简单通过Embedding层将单词映射到稠密空间,然后将句子中所有的单词在Embedding空间中进行平均,进而完成分类操作。
所以FastText是一个三层的神经网络,输入层、隐含层和输出层。

和TF-IDF比较

FastText在文本分类任务上,是优于TF-IDF的:
- FastText用单词的Embedding叠加获得的文档向量,将相似的句子分为一类
- FastText学习到的Embedding空间维度比较低,可以快速进行训练

相关论文

1.Bag of Tricks for Efficient Text Classification

FastText实现及开源
Keras实现FastText网络结构
>>> from keras.layers import GlobalAveragePooling1D
>>> from keras.models import Sequential
>>> from keras.layers import Embedding
>>> from keras.layers import GlobalAveragePooling1D
>>> from keras.layers import Dense
>>> VOCAB_SIZE=2000
>>> EMBEDDING_DIM=100
>>> MAX_WORDS=500
>>> CLASS_NUM=5
>>> model=Sequential()
>>> model.add(Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_WORDS))
>>> model.add(GlobalAveragePooling1D())
>>> model.add(Dense(CLASS_NUM, activation='softmax'))
>>> model.compile(loss='categorical_crossentropy', optimizer='SGD',metrics=['accuracy'])
>>> print(model.summary())
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 500, 100)          200000
_________________________________________________________________
global_average_pooling1d_1 ( (None, 100)               0
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 505
=================================================================
Total params: 200,505
Trainable params: 200,505
Non-trainable params: 0
_________________________________________________________________
None
官方开源版本

参考链接:https://github.com/facebookresearch/fastText/tree/master/python
可以采用pip安装,也可以采用源码安装。
这里采用pip安装的方式。

pip install fasttext --user
FastText进行文本分类

这里采用官方开源的FastText进行文本分类,调整参数时主要从以下方面入手:
- 通过阅读文档,要弄清楚这些参数的大致含义,那些参数会增加模型的复杂度
- 通过在验证集上进行验证模型精度,找到模型在是否过拟合还是欠拟合

初始版本
>>> import pandas as pd
>>> from sklearn.metrics import f1_score
>>> train_df = pd.read_csv('/home/zhangtong/Games/Text_Categorization/data/train_set.csv', sep='\t', nrows=15000)
>>> train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
>>> train_df[['text','label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')
>>> import fasttext
>>> model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2,verbose=2, minCount=1, epoch=25, loss="hs")
Read 9M words
Number of words:  5341
Number of labels: 14
Progress: 100.0% words/sec/thread:  946342 lr:  0.000000 avg.loss:  0.145202 ETA:   0h 0m 0s
>>> val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
>>> print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))
0.8243054609686247
修改参数

参数介绍:

    input             # training file path (required)
    lr                # learning rate [0.1]
    dim               # size of word vectors [100]
    ws                # size of the context window [5]
    epoch             # number of epochs [5]
    minCount          # minimal number of word occurences [1]
    minCountLabel     # minimal number of label occurences [1]
    minn              # min length of char ngram [0]
    maxn              # max length of char ngram [0]
    neg               # number of negatives sampled [5]
    wordNgrams        # max length of word ngram [1]
    loss              # loss function {ns, hs, softmax, ova} [softmax]
    bucket            # number of buckets [2000000]
    thread            # number of threads [number of cpus]
    lrUpdateRate      # change the rate of updates for the learning rate [100]
    t                 # sampling threshold [0.0001]
    label             # label prefix ['__label__']
    verbose           # verbose [2]
    pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

尝试修改参数:

>>> import pandas as pd
>>> from sklearn.metrics import f1_score
>>> train_df = pd.read_csv('/home/zhangtong/Games/Text_Categorization/data/train_set.csv', sep='\t', nrows=15000)
>>> train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
>>> train_df[['text','label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')
>>> import fasttext
>>> model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2,verbose=2, minCount=1, epoch=25, loss="softmax")
Read 9M words
Number of words:  5341
Number of labels: 14
Progress: 100.0% words/sec/thread:  917699 lr:  0.000000 avg.loss:  0.121626 ETA:   0h 0m 0s
>>> val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
>>> print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))
0.8771684912925092
应用验证集调参
常见交叉验证数据集划分方式
  1. train_test_split
  2. Standard Cross Validation
  3. Stratified k-fold cross validation
  4. Leave-one-out Cross-validation
  5. Shuffle-split cross-validation
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值