XGBoost及CNN算法的文本分类试验

最新推荐文章于 2024-09-21 23:45:00 发布

tzw_cs

最新推荐文章于 2024-09-21 23:45:00 发布

阅读量4.2k

点赞数 2

分类专栏： NLP

本文链接：https://blog.csdn.net/tanzhangwen/article/details/81118419

版权

本文介绍了作者使用XGBoost和CNN算法进行文本分类的实验过程。在一项新闻分类比赛中，通过词法分析和XGBoost算法，达到了95%的得分。随后，作者尝试了GitHub上的CNN方法，通过训练和验证，得到了较好的分类效果。尽管受限于老旧设备，但这次简单的实践为深度学习打下了基础。

摘要由CSDN通过智能技术生成

2018年6月份SMP会议有一个比赛，是头条抓取的新闻进行分类决定是人类作者、机器翻译、自动摘要、或者机器作者。多年没有做过这方面的工作，看到朋友的介绍，就想拿那个数据来练习一两个算法。

正好在网络上牛人看到有介绍连他以前参加类似文本分类比较的介绍以及在github上共享的源码，于是就clone了他的代码来做测试。但是因为小白，基本上看不太懂，加上我装的python 3以上的版本，所以错误百出。一番调试之后，重写了加载数据的部分，然后简单选取了一些特征用上xgboost算法。比赛中效果最好的达到了99%得分，而我只是简单选取了五十来个维度的特征，主要包括词法分析后文章中包含不同pos的数量，xgboost迭代后算法也能够达到95%的得分。可能现在主流的效果好的方法都是深度学习，但是这个算法简单，所以在这里也贴上作为调用的参考吧。代码很乱，但是对于xgboost可以参考这里的介绍，加上下面关键段的调用知道基本用法吧。

# -*-coding=UTF-8-*-  
import jieba
import jieba.posseg as pseg
import numpy as np
import xgboost as xgb
import sys

def readtrain(path, posd):
    labels = []
    contents = []
    ids = []
    weights = []
    with open(path, 'r') as f:
        for line in f:
            line = line[1:-2]
            tagwithv = line.split('\", \"')
            label = bytes(tagwithv[0].split('\": \"')[1].rstrip(' \"'), 'latin1').decode('unicode-escape')
            content = bytes(tagwithv[1].split('\": \"')[1].rstrip(' \"\\'), 'latin1').decode('unicode-escape')
            id = int(tagwithv[2].split(':')[1].strip())
            labels.append(label)
            contents.append(content)
            ids.append(id)
            weights.append(getweights(content, posd))
            if id % 1000 == 999:
            	print('Now %d articles have been processed!'%id)
            	#break
        print('Totally there are %d articles in training set' % len(labels))
    trains =(labels,contents,ids, weights)
    return trains

def getweights(content, posd):
	weights = []
	weights.append(sys.getsizeof(content)) # 0. number of bytes
	weights.append(len(content)) # 1. number of characters
	words = pseg.cut(content)
	wnum = 0 # 2. number of words
	engnum = 0 # 3. number of english words
	digitnum = 0 # 4. number of digital words
	chnum = 0 # 5. number of chinese words
	posn = [0 for i in range(0,50)]
	for w in words:
		wnum += 1
		#print(w.word + " " + w.flag)
		if is_chinese(w.word[0]):
			chnum += 1
		if is_alphabet(w.word[0]):
			engnum += 1
		if is_number(w.word[0]):
			digitnum += 1
		if w.flag.lower() in posd:
			posn[posd[w.flag.lower()]] += 1
		elif w.flag.lower()[0] in posd:
			posn[posd[w.flag.lower()[0]]] += 1
	weights.append(wnum)
	weights.append(engnum)
	weights.append(digitnum)
	weights.append(chnum)
	weights = weights + posn # 6-*: number of pos
	return weights

# 判断一个unicode是否是汉字
def is_chinese(uchar):

最低0.47元/天解锁文章