Python 数据分析第六期--文本数据分析

最新推荐文章于 2023-12-15 08:00:00 发布

从defeat

最新推荐文章于 2023-12-15 08:00:00 发布

阅读量1k

点赞数

分类专栏： Python 数据分析文章标签： python 自然语言处理

本文链接：https://blog.csdn.net/weixin_37956420/article/details/104407495

版权

Python 数据分析第六期–文本数据分析

1. Python 文本分析工具 NLTK

NLTK (Natural Language Toolkit)

NLP 领域最常用的一个 Python 库， NLP（natural language process）, 开源项目，自带分词，分类功能，强大的社区支持。

1.1 NLTK 安装

pip install nltk

语料库的安装，在命令行里安装，如果安装不成功，可离线下载。

import nltk
nltk.download()

1.2 文本预处理

在这里插入图片描述

1.2.1 分词

将句子拆分成具有语言语义学上意义的词，英文可用空格区分，而中文没有，较复杂，可用中文分词工具，如 “ 结巴分词 ” ，特殊字符的处理，可用正则表达式进行处理。

import nltk
from nltk.corpus import brown
# 需要下载brown语料库
# 引用布朗大学的语料库

# 查看语料库包含的类别
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

# 查看brown语料库
print('共有{}个句子'.format(len(brown.sents())))
print('共有{}个单词'.format(len(brown.words())))

共有57340个句子
共有1161192个单词

sentence = "Python is a widely used high-level programming language for general-purpose programming."
tokens = nltk.word_tokenize(sentence) # 需要下载punkt分词模型
print(tokens)

['Python', 'is', 'a', 'widely', 'used', 'high-level', 'programming', 'language', 'for', 'general-purpose', 'programming', '.']

结巴分词

# 安装 pip install jieba
import jieba

seg_list = jieba.cut("欢迎进入大学", cut_all=True)
print("全模式: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("欢迎进入大学", cut_all=False)
print("精确模式: " + "/ ".join(seg_list))  # 精确模式

1.2.2 词形归一化

英文中的如 “looked look looking”，在不同场景具有不同的词性，影响语料的准确性，需要进行词性归一化处理，具体做法如词干的提取，词性的归并。

词干提取

# PorterStemmer
from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()
print(porter_stemmer.stem('looked')

最低0.47元/天解锁文章

从defeat

关注

0
点赞
踩
19

收藏

觉得还不错? 一键收藏
0
评论
Python 数据分析第六期--文本数据分析

Python 数据分析第六期–文本数据分析1. Python 文本分析工具 NLTKNLTK (Natural Language Toolkit)NLP 领域最常用的一个 Python 库， NLP（natural language process）, 开源项目，自带分词，分类功能，强大的社区支持。1.1 NLTK 安装pip install nltk语料库的安装，...
复制链接

扫一扫