深度学习处理工具(NLTK、Text-Processing、TextBlob、jieba)

最新推荐文章于 2023-12-26 11:06:21 发布

子耶

最新推荐文章于 2023-12-26 11:06:21 发布

阅读量2k

点赞数 2

分类专栏： DL 文章标签： NLTK textblob jieba text-processing

本文链接：https://blog.csdn.net/qq_36962569/article/details/80024927

版权

DL 专栏收录该内容

5 篇文章 1 订阅

订阅专栏

1.1、NLTK

提供的文本处理库：
1、Classification（分类，比较少用）
2、Tokenization（词语切分，单词化处理）
3、Stemming（词干提取）
4、Tagging（标注，如词性标注）
5、Parsing（句法分析）
6、sementic reasoning（语义角色标注）

词根提取与词形还原：
1、词根提取

before
‘And I also like eating apple’
after(不一定能表达完整语义)
[‘and’, ‘I’, ‘also’, ‘like’, ‘to’, ‘eat’, u’appl’]

2、词形还原

before
‘And I also like eating apple’
after(还原为一般形式，能完整表达语义)
[‘And’, ‘I’, ‘also’, ‘like’, u’eat’, ‘apple’]

安装

pip install nltk

1.2、Text-Processing

提供的功能：
1、Stemming & Lemmatization
2、Sentiment Analysis
3、Tagging and Chunk Extraction
4、Phrase Extraction & Named Entity Recognition
使用

基于curl访问Text Processing API（windows10）
1、在https://curl.haxx.se/download.html下载
2、将压缩包解压到某位置
3、进入解压的文件夹的bin文件夹下即可使用
4、若需要在任意运行，则添加环境变量，bin文件夹的位置

1.3、TextBlob

提供的功能：
1、Noun phrase extraction（名词短语抽取）
2、Part-of-speech Tagging（词性标注）
3、Sentiment analysis（情感分析）
4、Classification（Naive Bayes, Decision Tree）（分类）
5、Language translation and detection powered by Google Translate（语言翻译与检测）
6、Tokenization（spliting Text into words and sentences）（切词）
7、Word and phrase frequencies（词频统计）
8、Word inflection（词形变化，singularization and pluralization，单数和复数）and lemmatization（词干提取）
9、Spelling correction（拼写检查）
安装

pip install textblob

1.4、jieba

提供功能：
1、中文分词（包括并行分词、支持自定义词典）
2、词性标注
3、关键词提取
安装

pip install jieba

分词使用案例

import jieba.analyse

path = 'ext.txt'
# 涉及gbk无法解码，添加 encoding='utf-8'
file_in = open(path, 'r',encoding='utf-8')
content = file_in.read()

try:
    # 需要把停词表 格式改成utf-8
   jieba.analyse.set_stop_words('ChineseSplitWords.txt')
    tags = jieba.analyse.extract_tags(content, topK=100, withWeight=True)
    for v, n in tags:
        #权重是小数，为了凑整，乘了一万
        print (v + '\t' + str(int(n * 10000)))

finally:
    file_in.close()

子耶

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
深度学习处理工具(NLTK、Text-Processing、TextBlob、jieba)

目录目录1.1、NLTK1.2、Text-Processing1.3、TextBlob1.4、jieba1.1、NLTK提供的文本处理库： 1、Classification（分类，比较少用） 2、Tokenization（词语切分，单词化处理） 3、Stemming（词干提取） 4、Tagging（标注，如词性标注） 5、Parsing（...
复制链接

扫一扫

专栏目录