（盘点）25个值得收藏的Python文本处理案例

最新推荐文章于 2022-12-30 12:23:24 发布

程序员-不秃头的阿焕

最新推荐文章于 2022-12-30 12:23:24 发布

阅读量2.7k

点赞数 3

文章标签： python 开发语言后端

本文链接：https://blog.csdn.net/Ah0609/article/details/123105323

版权

本文整理了25个Python文本处理案例，包括PDF和Word内容提取、JSON和CSV数据读取、文本清洗、NLP操作如词云创建、词频分析、情感分析等，适合开发者参考收藏。

摘要由CSDN通过智能技术生成

今天主要跟大家整理了25个值得收藏的Python文本处理案例。Python 处理文本是一项非常常见的功能，可以收藏起来，总会用到的，想要了解更多的关于python知识的，领取免费资源的，可以点击这个链接

1提取 PDF 内容

# pip install PyPDF2 安装 PyPDF2

import PyPDF2

from PyPDF2 import PdfFileReader

# Creating a pdf file object.

pdf = open("test.pdf", "rb")

# Creating pdf reader object.

pdf_reader = PyPDF2.PdfFileReader(pdf)

# Checking total number of pages in a pdf file.

print("Total number of Pages:", pdf_reader.numPages)

# Creating a page object.

page = pdf_reader.getPage(200)

# Extract data from a specific page number.

print(page.extractText())

# Closing the object.

pdf.close()

2提取 Word 内容

# pip install python-docx 安装 python-docx

import docx

def main():

try:

doc = docx.Document('test.docx') # Creating word reader object.

data = ""

fullText = []

for para in doc.paragraphs:

fullText.append(para.text)

data = '\n'.join(fullText)

print(data)

except IOError:

print('There was an error opening the file!')

return

if __name__ == '__main__':

main()

3提取 Web 网页内容

# pip install bs4 安装 bs4

from urllib.request import Request, urlopen

from bs4 import BeautifulSoup

req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',

headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()

# Parsing

soup = BeautifulSoup(webpage, 'html.parser')

# Formating the parsed html file

strhtm = soup.prettify()

# Print first 500 lines

print(strhtm[:500])

# Extract meta tag value

print(soup.title.string)

print(soup.find('meta', attrs={'property':'og:description'}))

# Extract anchor tag value

for x in soup.find_all('a'):

print(x.string)

# Extract Paragraph tag value

for x in soup.find_all('p'):

print(x.text)

4读取 Json 数据

import requests

import json

r = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")

res = r.json()

# Extract specific node content.

print(res['quiz']['sport'])

# Dump data as string

data = json.dumps(res)

print(data)

5读取 CSV 数据

import csv

with open('test.csv','r') as csv_file:

reader =csv.reader(csv_file)

next(reader) # Skip first row

for row in reader:

print(row)

6删除字符串中的标点符号

import re

import string

data = "Stuning even for the non-gamer: This sound track was beautiful!\

It paints the senery in your mind so well I would recomend\

it even to people who hate vid. game music! I have played the game Chrono \

Cross but out of all of the games I have ever played it has the best music! \

It backs away from crude keyboarding and takes a fresher step with grate\

guitars and soulful orchestras.\

It would impress anyone who cares to listen!"

# Methood 1 : Regex

# Remove the special charaters from the read string.

no_specials_string = re.sub('[!#?,.:";]', '', data)

print(no_specials_string)

# Methood 2 : translate()

# Rake translator object

translator = str.maketrans('', '', string.punctuation)

data = data.translate(translator)

print(data)

7使用 NLTK 删除停用词

from nltk.corpus import stopwords

data = ['Stuning even for the non-gamer: This sound track was beautiful!\

It paints the senery in your mind so well I would recomend\

it even to people who hate vid. game music! I have played the game Chrono \

Cross but out of all of the games I have ever played it has the best music! \

It backs away from crude keyboarding and takes a

最低0.47元/天解锁文章

程序员-不秃头的阿焕

关注

3
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫