今天主要跟大家整理了25个值得收藏的Python文本处理案例。Python 处理文本是一项非常常见的功能,可以收藏起来,总会用到的,想要了解更多的关于python知识的,领取免费资源的,可以点击这个链接
目录
1提取 PDF 内容
2提取 Word 内容
3提取 Web 网页内容
4读取 Json 数据
5读取 CSV 数据
6删除字符串中的标点符号
7使用 NLTK 删除停用词
8使用 TextBlob 更正拼写
9使用 NLTK 和 TextBlob 的词标记化
10使用 NLTK 提取句子单词或短语的词干列表
11使用 NLTK 进行句子或短语词形还原
12使用 NLTK 从文本文件中查找每个单词的频率
13从语料库中创建词云
14NLTK 词法散布图
15使用 countvectorizer 将文本转换为数字
16使用 TF-IDF 创建文档术语矩阵
17为给定句子生成 N-gram
18使用带有二元组的 sklearn CountVectorize 词汇规范
19使用 TextBlob 提取名词短语
20如何计算词-词共现矩阵
21使用 TextBlob 进行情感分析
22使用 Goslate 进行语言翻译
23使用 TextBlob 进行语言检测和翻译
24使用 TextBlob 获取定义和同义词
25使用 TextBlob 获取反义词列表
1提取 PDF 内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# pip install PyPDF2 安装 PyPDF2 import PyPDF2 from PyPDF2 import PdfFileReader # Creating a pdf file object. pdf = open ( "test.pdf" , "rb" ) # Creating pdf reader object. pdf_reader = PyPDF2.PdfFileReader(pdf) # Checking total number of pages in a pdf file. print ( "Total number of Pages:" , pdf_reader.numPages) # Creating a page object. page = pdf_reader.getPage( 200 ) # Extract data from a specific page number. print (page.extractText()) # Closing the object. pdf.close() |
2提取 Word 内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# pip install python-docx 安装 python-docx import docx def main(): try : doc = docx.Document( 'test.docx' ) # Creating word reader object. data = "" fullText = [] for para in doc.paragraphs: fullText.append(para.text) data = '\n' .join(fullText) print (data) except IOError: print ( 'There was an error opening the file!' ) return if __name__ = = '__main__' : main() |
3提取 Web 网页内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# pip install bs4 安装 bs4 from urllib.request import Request, urlopen from bs4 import BeautifulSoup req = Request( 'http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1' , headers = {
'User-Agent' : 'Mozilla/5.0' }) webpage = urlopen(req).read() # Parsing soup = BeautifulSoup(webpage, 'html.parser' ) # Formating the parsed html file strhtm = soup.prettify() # Print first 500 lines print (strhtm[: 500 ]) # Extract meta tag value print (soup.title.string) print (soup.find( 'meta' , attrs = {
'property' : 'og:description' })) # Extract anchor tag value for x in soup.find_all( 'a' ): print (x.string) # Extract Paragraph tag value for x in soup.find_all( 'p' ): print (x.text) |
4读取 Json 数据
1 2 3 4 5 6 7 8 9 10 11 12 |
import requests import json r = requests.get( "https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json" ) res = r.json() # Extract specific node content. print (res[ 'quiz' ][ 'sport' ]) # Dump data as string data = json.dumps(res) print (data) |
5读取 CSV 数据
1 2 3 4 5 6 7 |
import csv with open ( 'test.csv' , 'r' ) as csv_file: reader = csv.reader(csv_file) next (reader) # Skip first row for row in reader: print (row) |
6删除字符串中的标点符号
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import re import string data = "Stuning even for the non - gamer: This sound track was beautiful!\ It paints the senery in your mind so well I would recomend\ it even to people who hate vid. game music! I have played the game Chrono \ Cross but out of all of the games I have ever played it has the best music! \ It backs away from crude keyboarding and takes a fresher step with grate\ guitars and soulful orchestras.\ It would impress anyone who cares to listen!" # Methood 1 : Regex # Remove the special charaters from the read string. no_specials_string = re.sub( '[!#?,.:";]' , '', data) print (no_specials_string) # Methood 2 : translate() # Rake translator object translator = str .maketrans(' ', ' ', string.punctuation) data = data.translate(translator) print (data) |
7使用 NLTK 删除停用词
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from nltk.corpus import stopwords data = ['Stuning even for the non - gamer: This sound track was beautiful!\ It paints the senery in your mind so well I would recomend\ it even to people who hate vid. game music! I have played the game Chrono \ Cross but out of all of the games I have ever played it has the best music! \ It backs away from crude keyboarding and takes a
|