今天主要跟大家整理了25个值得收藏的Python文本处理案例。Python 处理文本是一项非常常见的功能,可以收藏起来,总会用到的,想要了解更多的关于python知识的,领取免费资源的,可以点击这个链接
目录
1提取 PDF 内容
2提取 Word 内容
3提取 Web 网页内容
4读取 Json 数据
5读取 CSV 数据
6删除字符串中的标点符号
7使用 NLTK 删除停用词
8使用 TextBlob 更正拼写
9使用 NLTK 和 TextBlob 的词标记化
10使用 NLTK 提取句子单词或短语的词干列表
11使用 NLTK 进行句子或短语词形还原
12使用 NLTK 从文本文件中查找每个单词的频率
13从语料库中创建词云
14NLTK 词法散布图
15使用 countvectorizer 将文本转换为数字
16使用 TF-IDF 创建文档术语矩阵
17为给定句子生成 N-gram
18使用带有二元组的 sklearn CountVectorize 词汇规范
19使用 TextBlob 提取名词短语
20如何计算词-词共现矩阵
21使用 TextBlob 进行情感分析
22使用 Goslate 进行语言翻译
23使用 TextBlob 进行语言检测和翻译
24使用 TextBlob 获取定义和同义词
25使用 TextBlob 获取反义词列表
1提取 PDF 内容
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# pip install PyPDF2 安装 PyPDF2 import PyPDF2 from PyPDF2 import PdfFileReader # Creating a pdf file object. pdf = open("test.pdf", "rb") # Creating pdf reader object. pdf_reader = PyPDF2.PdfFileReader(pdf) # Checking total number of pages in a pdf file. print("Total number of Pages:", pdf_reader.numPages) # Creating a page object. page = pdf_reader.getPage(200) # Extract data from a specific page number. print(page.extractText()) # Closing the object. pdf.close() |
2提取 Word 内容
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# pip install python-docx 安装 python-docx import docx def main(): try: doc = docx.Document('test.docx') # Creating word reader object. data = "" fullText = [] for para in doc.paragraphs: fullText.append(para.text) data = '\n'.join(fullText) print(data) except IOError: print('There was an error opening the file!') return if __name__ == '__main__': main() |
3提取 Web 网页内容
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# pip install bs4 安装 bs4 from urllib.request import Request, urlopen from bs4 import BeautifulSoup req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={
'User-Agent': 'Mozilla/5.0'}) webpage = urlopen(req).read() # Parsing soup = BeautifulSoup(webpage, 'html.parser') # Formating the parsed html file strhtm = soup.prettify() # Print first 500 lines print(strhtm[:500]) # Extract meta tag value print(soup.title.string) print(soup.find('meta', attrs={
'property':'og:description'})) # Extract anchor tag value for x in soup.find_all('a'): print(x.string) # Extract Paragraph tag value for x in soup.find_all('p'): print(x.text) |
4读取 Json 数据
| 1 2 3 4 5 6 7 8 9 10 11 12 |
import requests import json r = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json") res = r.json() # Extract specific node content. print(res['quiz']['sport']) # Dump data as string data = json.dumps(res) print(data) |
5读取 CSV 数据
| 1 2 3 4 5 6 7 |
import csv with open('test.csv','r') as csv_file: reader =csv.reader(csv_file) next(reader) # Skip first row for row in reader: print(row) |
6删除字符串中的标点符号
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import re import string data = "Stuning even for the non-gamer: This sound track was beautiful!\ It paints the senery in your mind so well I would recomend\ it even to people who hate vid. game music! I have played the game Chrono \ Cross but out of all of the games I have ever played it has the best music! \ It backs away from crude keyboarding and takes a fresher step with grate\ guitars and soulful orchestras.\ It would impress anyone who cares to listen!" # Methood 1 : Regex # Remove the special charaters from the read string. no_specials_string = re.sub('[!#?,.:";]', '', data) print(no_specials_string) # Methood 2 : translate() # Rake translator object translator = str.maketrans('', '', string.punctuation) data = data.translate(translator) print(data) |
7使用 NLTK 删除停用词
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from nltk.corpus import stopwords data = ['Stuning even for the non-gamer: This sound track was beautiful!\ It paints the senery in your mind so well I would recomend\ it even to people who hate vid. game music! I have played the game Chrono \ Cross but out of all of the games I have ever played it has the best music! \ It backs away from crude keyboarding and takes a
|