【python学习笔记】：文本处理案例（二）

最新推荐文章于 2024-03-09 14:12:32 发布

姜子牙大侠

最新推荐文章于 2024-03-09 14:12:32 发布

阅读量636

点赞数

分类专栏： python 文章标签： python 开发语言

本文链接：https://blog.csdn.net/Jiangziyadizi/article/details/129133936

版权

本文详述了Python在文本处理方面的应用，包括使用NLTK创建词云、进行词法散布图分析，利用countvectorizer和TF-IDF进行文本数字化和文档矩阵构建，生成N-gram，使用TextBlob提取名词短语、进行情感分析、语言翻译和检测，以及词-词共现矩阵的计算等NLP相关案例。

摘要由CSDN通过智能技术生成

Python 处理文本是一项非常常见的功能，本文整理了多种文本提取及NLP相关的案例，会分两篇来说，建议收藏，总会用到的。

从语料库中创建词云
NLTK 词法散布图
使用 countvectorizer 将文本转换为数字
使用 TF-IDF 创建文档术语矩阵
为给定句子生成 N-gram
使用带有二元组的 sklearn CountVectorize 词汇规范
使用 TextBlob 提取名词短语
如何计算词-词共现矩阵
使用 TextBlob 进行情感分析
使用 Goslate 进行语言翻译
使用 TextBlob 进行语言检测和翻译
使用 TextBlob 获取定义和同义词
使用 TextBlob 获取反义词列表

13从语料库中创建词云

import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
 
nltk.download('webtext')
wt_words = webtext.words('testing.txt')  # Sample data
data_analysis = nltk.FreqDist(wt_words)
 
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
 
wcloud = WordCloud().generate_from_frequencies(filter_words)
 
# Plotting the wordcloud
plt.imshow(wcloud, interpolation="bilinear")
 
plt.axis("off")
(-0.5, 399.5, 199.5, -0.5)
plt.show()

14NLTK 词法散布图

import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
 
words = ['data', 'science', 'dataset']
 
nltk.download('webtext')
wt_words = webtext.words('testing.txt')  # Sample data
 
points = [(x, y) for x in range(len(wt_words))
          for y in range(len(words)) if wt_words[x] == words[y]]
 
if points:
    x, y = zip(*points)
else:
    x = y = ()
 
plt.plot(x, y, "rx", scalex=.1)
plt.yticks(range(len(words)), words, color="b")
plt.ylim(-1, len(words))
plt.title("Lexical Dispersion Plot")
plt.xlabel("Word Offset")
plt.show()

15使用 countvectorizer 将文本转换为数字

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
 
# Sample data for analysis
data1 = "Java is a language for programming that develops a software for several platforms. A compiled code or bytecode on Java application can run on most of the operating systems including Linux, Mac operating system, and Linux. Most of the syntax of Java is derived from the C++ and C languages."
data2 = "Python supports multiple programming paradigms and comes up with a large standard library, paradigms included are object-oriented, imperative, functional and procedural."
data3 = "Go is typed statically compiled language. It was created by Robert Griesemer, Ken Thompson, and Rob Pike in 2009. This language offers garbage collection, concurrency of CSP-style, memory safety, and structural typing."
 
df1 = pd.DataFrame({'Java': [data1], 'Python': [data2], 'Go': [data2]})
 
# Initialize
vectorizer = CountVectorizer()
doc_vec = vectorizer.