(盘点)25个值得收藏的Python文本处理案例

本文整理了25个Python文本处理案例,包括PDF和Word内容提取、JSON和CSV数据读取、文本清洗、NLP操作如词云创建、词频分析、情感分析等,适合开发者参考收藏。
摘要由CSDN通过智能技术生成

今天主要跟大家整理了25个值得收藏的Python文本处理案例。Python 处理文本是一项非常常见的功能,可以收藏起来,总会用到的,想要了解更多的关于python知识的,领取免费资源的,可以点击这个链接

目录

1提取 PDF 内容

2提取 Word 内容

3提取 Web 网页内容

4读取 Json 数据

5读取 CSV 数据

6删除字符串中的标点符号

7使用 NLTK 删除停用词

8使用 TextBlob 更正拼写

9使用 NLTK 和 TextBlob 的词标记化

10使用 NLTK 提取句子单词或短语的词干列表

11使用 NLTK 进行句子或短语词形还原

12使用 NLTK 从文本文件中查找每个单词的频率

13从语料库中创建词云

14NLTK 词法散布图

15使用 countvectorizer 将文本转换为数字

16使用 TF-IDF 创建文档术语矩阵

17为给定句子生成 N-gram

18使用带有二元组的 sklearn CountVectorize 词汇规范

19使用 TextBlob 提取名词短语

20如何计算词-词共现矩阵

21使用 TextBlob 进行情感分析

22使用 Goslate 进行语言翻译

23使用 TextBlob 进行语言检测和翻译

24使用 TextBlob 获取定义和同义词

25使用 TextBlob 获取反义词列表


1提取 PDF 内容

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

# pip install PyPDF2  安装 PyPDF2

import PyPDF2

from PyPDF2 import PdfFileReader

  

# Creating a pdf file object.

pdf = open("test.pdf", "rb")

  

# Creating pdf reader object.

pdf_reader = PyPDF2.PdfFileReader(pdf)

  

# Checking total number of pages in a pdf file.

print("Total number of Pages:", pdf_reader.numPages)

  

# Creating a page object.

page = pdf_reader.getPage(200)

  

# Extract data from a specific page number.

print(page.extractText())

  

# Closing the object.

pdf.close()

2提取 Word 内容

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

# pip install python-docx  安装 python-docx

import docx

  

  

def main():

    try:

        doc = docx.Document('test.docx')  # Creating word reader object.

        data = ""

        fullText = []

        for para in doc.paragraphs:

            fullText.append(para.text)

            data = '\n'.join(fullText)

  

        print(data)

  

    except IOError:

        print('There was an error opening the file!')

        return

  

  

if __name__ == '__main__':

    main()

3提取 Web 网页内容

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

# pip install bs4  安装 bs4

from urllib.request import Request, urlopen

from bs4 import BeautifulSoup

  

req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',

              headers={ 'User-Agent': 'Mozilla/5.0'})

  

webpage = urlopen(req).read()

  

# Parsing

soup = BeautifulSoup(webpage, 'html.parser')

  

# Formating the parsed html file

strhtm = soup.prettify()

  

# Print first 500 lines

print(strhtm[:500])

  

# Extract meta tag value

print(soup.title.string)

print(soup.find('meta', attrs={ 'property':'og:description'}))

  

# Extract anchor tag value

for x in soup.find_all('a'):

    print(x.string)

  

# Extract Paragraph tag value    

for x in soup.find_all('p'):

    print(x.text)

4读取 Json 数据

1

2

3

4

5

6

7

8

9

10

11

12

import requests

import json

r = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")

res = r.json()

# Extract specific node content.

print(res['quiz']['sport'])

# Dump data as string

data = json.dumps(res)

print(data)

5读取 CSV 数据

1

2

3

4

5

6

7

import csv

with open('test.csv','r') as csv_file:

    reader =csv.reader(csv_file)

    next(reader) # Skip first row

    for row in reader:

        print(row)

6删除字符串中的标点符号

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

import re

import string

  

data = "Stuning even for the non-gamer: This sound track was beautiful!\

It paints the senery in your mind so well I would recomend\

it even to people who hate vid. game music! I have played the game Chrono \

Cross but out of all of the games I have ever played it has the best music! \

It backs away from crude keyboarding and takes a fresher step with grate\

guitars and soulful orchestras.\

It would impress anyone who cares to listen!"

  

# Methood 1 : Regex

# Remove the special charaters from the read string.

no_specials_string = re.sub('[!#?,.:";]', '', data)

print(no_specials_string)

  

  

# Methood 2 : translate()

# Rake translator object

translator = str.maketrans('', '', string.punctuation)

data = data.translate(translator)

print(data)

7使用 NLTK 删除停用词

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

from nltk.corpus import stopwords

  

  

data = ['Stuning even for the non-gamer: This sound track was beautiful!\

It paints the senery in your mind so well I would recomend\

it even to people who hate vid. game music! I have played the game Chrono \

Cross but out of all of the games I have ever played it has the best music! \

It backs away from crude keyboarding and takes a

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值