2018年6月份SMP会议有一个比赛,是头条抓取的新闻进行分类决定是人类作者、机器翻译、自动摘要、或者机器作者。多年没有做过这方面的工作,看到朋友的介绍,就想拿那个数据来练习一两个算法。
正好在网络上牛人看到有介绍连他以前参加类似文本分类比较的介绍以及在github上共享的源码,于是就clone了他的代码来做测试。但是因为小白,基本上看不太懂,加上我装的python 3以上的版本,所以错误百出。一番调试之后,重写了加载数据的部分,然后简单选取了一些特征用上xgboost算法。比赛中效果最好的达到了99%得分,而我只是简单选取了五十来个维度的特征,主要包括词法分析后文章中包含不同pos的数量,xgboost迭代后算法也能够达到95%的得分。可能现在主流的效果好的方法都是深度学习,但是这个算法简单,所以在这里也贴上作为调用的参考吧。代码很乱,但是对于xgboost可以参考这里的介绍,加上下面关键段的调用知道基本用法吧。
# -*-coding=UTF-8-*-
import jieba
import jieba.posseg as pseg
import numpy as np
import xgboost as xgb
import sys
def readtrain(path, posd):
labels = []
contents = []
ids = []
weights = []
with open(path, 'r') as f:
for line in f:
line = line[1:-2]
tagwithv = line.split('\", \"')
label = bytes(tagwithv[0].split('\": \"')[1].rstrip(' \"'), 'latin1').decode('unicode-escape')
content = bytes(tagwithv[1].split('\": \"')[1].rstrip(' \"\\'), 'latin1').decode('unicode-escape')
id = int(tagwithv[2].split(':')[1].strip())
labels.append(label)
contents.append(content)
ids.append(id)
weights.append(getweights(content, posd))
if id % 1000 == 999:
print('Now %d articles have been processed!'%id)
#break
print('Totally there are %d articles in training set' % len(labels))
trains =(labels,contents,ids, weights)
return trains
def getweights(content, posd):
weights = []
weights.append(sys.getsizeof(content)) # 0. number of bytes
weights.append(len(content)) # 1. number of characters
words = pseg.cut(content)
wnum = 0 # 2. number of words
engnum = 0 # 3. number of english words
digitnum = 0 # 4. number of digital words
chnum = 0 # 5. number of chinese words
posn = [0 for i in range(0,50)]
for w in words:
wnum += 1
#print(w.word + " " + w.flag)
if is_chinese(w.word[0]):
chnum += 1
if is_alphabet(w.word[0]):
engnum += 1
if is_number(w.word[0]):
digitnum += 1
if w.flag.lower() in posd:
posn[posd[w.flag.lower()]] += 1
elif w.flag.lower()[0] in posd:
posn[posd[w.flag.lower()[0]]] += 1
weights.append(wnum)
weights.append(engnum)
weights.append(digitnum)
weights.append(chnum)
weights = weights + posn # 6-*: number of pos
return weights
# 判断一个unicode是否是汉字
def is_chinese(uchar):