中文分词_数据分析之中文分词

最新推荐文章于 2022-10-06 09:56:45 发布

weixin_39759881

最新推荐文章于 2022-10-06 09:56:45 发布

阅读量394

点赞数

文章标签：中文分词

本文链接：https://blog.csdn.net/weixin_39759881/article/details/111708403

版权

本文介绍了使用jieba包和百度AI开放平台进行中文分词的方法，通过获取AccessToken调用自然语言处理接口。主要步骤包括将词典数据转换为DataFrame，进行词频统计并绘制词云图。在实际分析中，需要注意同义词的处理。

摘要由CSDN通过智能技术生成

使用python做中文分词最好的就是jieba包，除此之外，我们还可以使用百度的AI开放平台，获取其AccessToken，调用自然语言处理接口，进行中文分词处理。

主要步骤：‍

# 导入requests工具包import requests# client_id 为官网获取的AK， client_secret 为官网获取的SKhost = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=【官网获取的AK】&client_secret=【官网获取的SK】'response = requests.get(host)#借助AK和SK，获取access_tokenif response:    access_token = response.json()['access_token']# 按要求写入header参数headers = {'content-type': 'application/json'}# 生成urlurl = 'https://aip.baidubce.com/rpc/2.0/nlp/v1/lexer?charset=UTF-8&access_token=' +  access_token

#读取特定文本文件f=open('C:est.txt','rb')f_read=f.read()news=f_read.decode('utf-8')

response = requests.post(url,                          data = json.dumps({'text': news}),    # news变量写在这里                         headers = headers)if response:    result = response.json()['items']

注：该结果为每个词汇的属性所组成的字典
针对字典型数据，我们可以使用DataFrame将其转化为表格化数据

import  pandas as pddf = pd.DataFrame(result['items'])df.head()

#在分词中，我们一般只需要‘item’,‘ne’,‘pos’这三列即可，所以只索引这三列df_new = df[['item','ne','pos']]df_new

df_n =df_new[(df_new['ne'] == 'PER')| (df_new['pos'] =='n')]df_n

对分析结果进行词频统计

df_n_count = df_n['item'].value_counts()df_n_count.head(20)

针对分词结果绘制词云图

import pyecharts as pewd = pe.WordCloud('老罗卖货词云图')  #词云图图名words = df_n_count.index  #构建词云图中的词words_size = df_n_count.values  #构建每个词的 word_size_range大小# 绘制图表wd.add("", words, words_size, shape = 'cardioid',       word_size_range=[20, 100], rotate_step=10)#参数分析：''为图例，本图不需要，words和words_size分别为上面所赋值的参数#shape 为词云图的形状，有'circle', 'cardioid', 'diamond', 'triangle-forward', 'triangle', 'pentagon', 'star'可选# word_size_range为单词大小的范围，rotate_step为旋转单词角度，默认时为 45°

(注：其中有部分词为同义词，并未替换，正式的数据分析中需要将其转为同一词进行统计)
参考来源：
CSDN 《利用百度的词法分析区分数据》

- END -

李
读书、观影
分享生活的碎片
有理想的人不会伤心

weixin_39759881

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
中文分词_数据分析之中文分词

使用python做中文分词最好的就是jieba包，除此之外，我们还可以使用百度的AI开放平台，获取其AccessToken，调用自然语言处理接口，进行中文分词处理。主要步骤：‍# 导入requests工具包import requests# client_id 为官网获取的AK， client_secret 为官网获取的SKhost = 'https://aip.baidubce.com/oauth...
复制链接

扫一扫