文本分类

最新推荐文章于 2022-04-15 18:10:57 发布

涤生（bluez）

最新推荐文章于 2022-04-15 18:10:57 发布

阅读量145

点赞数

分类专栏：数据科学入门到精通文章标签：数据科学

本文链接：https://blog.csdn.net/weixin_40903057/article/details/95367579

版权

数据科学入门到精通专栏收录该内容

83 篇文章 1 订阅

订阅专栏

语料的读取与处理

import pandas as pd
df_news=pd.read_table(r'C:\Users\CDAer\Desktop\data\car.txt',
                   names=['category','theme','url','content'])

import jieba
#提取新闻的内容，并将其转变成列表
content_list=df_news['content'].values.tolist()
#导入停留词
stopwords=pd.read_csv(r'C:\Users\CDAer\Desktop\data\stopwords.txt',sep='\t',
                     quoting=3,names=['stopword'])
stopwords_list=stopwords['stopword'].values.tolist()
#去除分词长度为1的词，去除空值
contents_clean=[]
for line in content_list:
    seg=jieba.lcut(line)
    line_clean=''
    for word in seg:
        if word =='\n' or len(word)<=1:
            continue
        elif word in stopwords_list:
            continue
        else:
            line_clean=line_clean+' '+word   
    contents_clean.append(line_clean)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\CDAer\AppData\Local\Temp\jieba.cache
Loading model cost 1.112 seconds.
Prefix dict has been built succesfully.

df_data=pd.DataFrame({'contents_clean':contents_clean,'category':df_news['category']})

df_data.head()

	contents_clean	category
0	经销商电话试驾订车杭州滨江区江陵保常自魄白云大道北广州市天河区黄...	汽车
1	呼叫热线服务邮箱	汽车
2	品牌二月公布最新概念车效果图日内瓦车展品牌带来全新概念车性能车型...	汽车
3	清仓甩卖一汽夏利威志低至启新中国一汽强势推出一汽夏利威志清仓 ...	汽车
4	日内瓦车展见到高尔夫家族成员高尔夫敞篷版全新敞篷车众多高尔夫车迷 ...	汽车

数据的整理

df_data.category.unique()

array(['汽车', '财经', '科技', '健康', '体育', '教育', '文化', '军事', '娱乐', '时尚'],
      dtype=object)

label_map={'汽车':1,'财经':2,'科技':3,'健康':4,'教育':5,'文化':6,'军事':7,'娱乐':8,'体育':9,'时尚':10}

df_data['category']=df_data['category'].map(label_map)

df_data.head()

	contents_clean	category
0	经销商电话试驾订车杭州滨江区江陵保常自魄白云大道北广州市天河区黄...	1
1	呼叫热线服务邮箱	1
2	品牌二月公布最新概念车效果图日内瓦车展品牌带来全新概念车性能车型...	1
3	清仓甩卖一汽夏利威志低至启新中国一汽强势推出一汽夏利威志清仓 ...	1
4	日内瓦车展见到高尔夫家族成员高尔夫敞篷版全新敞篷车众多高尔夫车迷 ...	1

训练集与测试集的划分

vec=TfidfVectorizer(analyzer='word',max_features=2000,lowercase=False)
vec.fit(df_data['contents_clean'].values)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=2000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

new_data

<5000x2000 sparse matrix of type '<class 'numpy.float64'>'
	with 214219 stored elements in Compressed Sparse Row format>

X_train,X_test,y_train,y_test=train_test_split(df_data['contents_clean'].values,
                                df_data['category'],test_size=0.2,random_state=1)

type(X_train)

numpy.ndarray

len(X_train[1])

将训练集与测试集特征向量化

from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVectorizer?

vec=TfidfVectorizer(analyzer='word',max_features=2000,lowercase=False)
vec.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=2000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

TfidfVectorizer.fit?

x=vec.transform(X_train)

x_test=vec.transform(X_test)

贝叶斯分类器的训练与测试

from sklearn.naive_bayes import  MultinomialNB

cls=MultinomialNB()
cls.fit(x,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

cls.score(x_test,y_test)

0.806

涤生（bluez）

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录