文本进行分类

最新推荐文章于 2024-05-08 11:33:01 发布

bbzz2

最新推荐文章于 2024-05-08 11:33:01 发布

阅读量511

点赞数

分类专栏： NLP

NLP 专栏收录该内容

28 篇文章 1 订阅

订阅专栏

文本进行分类

测试facebook开源的基于深度学习的对文本分类的fastText模型
fasttext Python包的安装:

pip install fasttext
   
   1
   
   1

第一步获取分类文本，文本直接用的清华大学的新闻分本，可在文本系列的第三篇找到下载地址。
数据格式：样本 + 样本标签

import jieba

basedir = "/home/li/corpus/news/"
dir_list = ['affairs','constellation','economic','edu','ent','fashion','game','home','house','lottery','science','sports','stock']
##生成fastext的训练和测试数据集

ftrain = open("news_fasttext_train.txt","w")
ftest = open("news_fasttext_test.txt","w")

num = -1
for e in dir_list:
    num += 1
    indir = basedir + e + '/'
    files = os.listdir(indir)
    count = 0
    for file in files:
        count += 1            
        filepath = indir + file
        with open(filepath,'r') as fr:
            text = fr.read()
        text = text.decode("utf-8").encode("utf-8")
        seg_text = jieba.cut(text.replace("\t"," ").replace("\n"," "))
        outline = " ".join(seg_text)
        outline = outline.encode("utf-8") + "\t__label__" + e + "\n"
#         print outline
#         break

        if count < 10000:
            ftrain.write(outline)
            ftrain.flush()
            continue
        elif count  < 20000:
            ftest.write(outline)
            ftest.flush()
            continue
        else:
            break

ftrain.close()
ftest.close()
   
   1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
   
   1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

整理好的数据：百度网盘下载
news_fasttext_train.txt
news_fasttext_test.txt

# _*_coding:utf-8 _*_
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
   
   1
2
3
   
   1
2
3

第二步：利用fasttext进行分类。使用的是fasttext的python包。

import fasttext
#训练模型
classifier = fasttext.supervised("news_fasttext_train.txt","news_fasttext.model",label_prefix="__label__")

#load训练好的模型
#classifier = fasttext.load_model('news_fasttext.model.bin', label_prefix='__label__')

   
   1
2
3
4
5
6
7
   
   1
2
3
4
5
6
7

#测试模型
result = classifier.test("news_fasttext_test.txt")
print result.precision
print result.recall

   
   1
2
3
4
5
   
   1
2
3
4
5

0.92240420242
0.92240420242

由于fasttext貌似只提供全部结果的p值和r值，想要统计不同分类的结果，就需要自己写代码来实现了。

labels_right = []
texts = []
with open("news_fasttext_test.txt") as fr:
    lines = fr.readlines()
for line in lines:
    labels_right.append(line.split("\t")[1].rstrip().replace("__label__",""))
    texts.append(line.split("\t")[0].decode("utf-8"))
#     print labels
#     print texts
#     break
labels_predict = [e[0] for e in classifier.predict(texts)] #预测输出结果为二维形式
# print labels_predict

text_labels = list(set(labels_right))
text_predict_labels = list(set(labels_predict))
print text_predict_labels
print text_labels

A = dict.fromkeys(text_labels,0)  #预测正确的各个类的数目
B = dict.fromkeys(text_labels,0)   #测试数据集中各个类的数目
C = dict.fromkeys(text_predict_labels,0) #预测结果中各个类的数目
for i in range(0,len(labels_right)):
    B[labels_right[i]] += 1
    C[labels_predict[i]] += 1
    if labels_right[i] == labels_predict[i]:
        A[labels_right[i]] += 1

print A 
print B
print C
#计算准确率，召回率，F值
for key in B:
    p = float(A[key]) / float(B[key])
    r = float(A[key]) / float(C[key])
    f = p * r * 2 / (p + r)
    print "%s:\tp:%f\t%fr:\t%f" % (key,p,r,f)


   
   1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
   
   1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

实验数据分类

[u'affairs', u'fashion', u'lottery', u'house', u'science', u'sports', u'game', u'economic', u'ent', u'edu', u'home', u'constellation', u'stock']
['affairs', 'fashion', 'house', 'sports', 'game', 'economic', 'ent', 'edu', 'home', 'stock', 'science']
{'science': 8415, 'affairs': 8257, 'fashion': 3173, 'house': 9491, 'sports': 9739, 'game': 9506, 'economic': 9235, 'ent': 9665, 'edu': 9491, 'home': 9315, 'stock': 9015}
{'science': 10000, 'affairs': 10000, 'fashion': 3369, 'house': 10000, 'sports': 10000, 'game': 10000, 'economic': 10000, 'ent': 10000, 'edu': 10000, 'home': 10000, 'stock': 10000}
{u'affairs': 8562, u'fashion': 3585, u'lottery': 96, u'science': 9088, u'edu': 10068, u'sports': 10099, u'game': 10151, u'economic': 10131, u'ent': 10798, u'house': 10000, u'home': 10103, u'constellation': 432, u'stock': 10256}

#实验结果

science:    p:0.841500  r:0.925946r:    f:0.881706
affairs:    p:0.825700  r:0.964377r:    f:0.889667
fashion:    p:0.941822  r:0.885077r:    f:0.912568
house:  p:0.949100  r:0.949100r:    f:0.949100
sports: p:0.973900  r:0.964353r:    f:0.969103
game:   p:0.950600  r:0.936459r:    f:0.943477
economic:   p:0.923500  r:0.911559r:    f:0.917490
ent:    p:0.966500  r:0.895073r:    f:0.929416
edu:    p:0.949100  r:0.942690r:    f:0.945884
home:   p:0.931500  r:0.922003r:    f:0.926727
stock:  p:0.901500  r:0.878998r:    f:0.890107
   
   1
2
3
4
5
6
7
8
9
10
11
   
   1
2
3
4
5
6
7
8
9
10
11

从结果上，看出fasttext的分类效果还是不错的，没有进行对fasttext的调参，结果都基本在90以上，不过在预测的时候，不知道怎么多出了一个分类constellation。难道。。。。查找原因中。。。。
2016/11/7更正：从集合B中可以看出训练集的标签中是没有lottery和constellation的数据的，说明在数据准备的时候，每类选取10000篇，导致在测试数据集中lottery和constellation不存在数据了。因此在第一步准备数据的时候可以根据lottery和constellation类的数据进行训练集和测试集的大小划分，或者简单粗暴点，这两类没有达到我们的数量要求，可以直接删除掉

顶

踩