python爬虫学习第二十五天

最新推荐文章于 2024-04-06 23:41:57 发布

可惜没有如果

最新推荐文章于 2024-04-06 23:41:57 发布

阅读量214

点赞数

分类专栏：学习笔记文章标签： python

本文链接：https://blog.csdn.net/qq_34194478/article/details/77387741

版权

学习笔记专栏收录该内容

45 篇文章 0 订阅

订阅专栏

首先，前一章还有一节内容是讲用python发送邮件的，但是我的电脑配置不好SMTP(simple mail transfer protocol)客户端，所以这部分的练习做不了。

下面进入下一章，读取文档。
本章重点介绍文档处理的相关内容，包括把文件下载到文件夹里，以及读取文档并提取数据。还会介绍文档的不同编码类型，让程序可以读取非英文的 HTML 页面。

不同的网站有可能会用不同的编码方式组织01信息，这时候解码的方式尤为重要，错误的解码方式会使得字符串的意思变得迥然不同。

练习1 英语网站和非英语网站使用默认读取方式的差别
运行后就能清楚的看出，第一段代码的结果是乱码的，因为原网页的是俄语，而我们却使用了系统默认的给英语的编码方式

# from urllib.request import urlopen

# html = urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1-ru.txt")
# print(html.read())

# from urllib.request import urlopen

# html = urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1.txt")
# print(html.read())

换成utf8方式decode后，正确的内容得以显现

# from urllib.request import urlopen

# html = urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1-ru.txt")
# print(str(html.read(),'utf-8'))

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html)
content = bsObj.find('div',{'id':'mw-content-text'}).get_text()
content = bytes(content,'utf-8')
print(content.decode('utf-8'))

如果你要做很多网络数据采集工作，尤其是面对国际网站时，建议你先看看 meta 标签的内容，用网站推荐的编码方式读取页面内容。

练习3 从网上获取一个 CSV 文件然后把每一行都打印到命令行

# from urllib.request import urlopen
# from io import StringIO
# import csv

# data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode("ascii","ignore")
# dataFile = StringIO(data)
# csvFile = csv.reader(dataFile)
# for row in csvFile:
#   print(row)

输出的格式是下面的样子：

[‘Name’, ‘Year’]
[“Monty Python’s Flying Circus”, ‘1970’]
[‘Another Monty Python Record’, ‘1971’]
[“Monty Python’s Previous Record”, ‘1972’]
[‘The Monty Python Matching Tie and Handkerchief’, ‘1973’]
[‘Monty Python Live at Drury Lane’, ‘1974’]
[‘An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail’, ‘1975’]
[‘Monty Python Live at City Center’, ‘1977’]
[‘The Monty Python Instant Record Collection’, ‘1977’]
[“Monty Python’s Life of Brian”, ‘1979’]
[“Monty Python’s Cotractual Obligation Album”, ‘1980’]
[“Monty Python’s The Meaning of Life”, ‘1983’]
[‘The Final Rip Off’, ‘1987’]
[‘Monty Python Sings’, ‘1989’]
[‘The Ultimate Monty Python Rip Off’, ‘1994’]
[‘Monty Python Sings Again’, ‘2014’]

从输出格式可以看出，csv.reader 返回的 csvReader 对象是可迭代的，而且由 Python 的列表对象构成

还用另一种对象叫DictReader对象，csv.DictReader 会返回把 CSV 文件每一行转换成 Python 的字典对象返回

from urllib.request import urlopen
from io import StringIO
import csv

data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode("ascii","ignore")
dataFile = StringIO(data)
dictReader = csv.DictReader(dataFile)
print(dictReader.fieldnames)
for row in dictReader:
    print(row)

看完了文本文档，接下来看一下PDF文档的读取，python3.x内置库不支持PDF处理，所以需要下第三方库，例如本人用的PDFminer这个库，他是通过python源码安装的

练习4 把任意 PDF 读成字符串，然后用 StringIO 转换成文件对象

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager,process_pdf
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from io import StringIO
from io import open

def readPDF(PDFfile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr,device,PDFfile)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    return content
    pass
pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")
outPutString = readPDF(pdfFile)
print(outPutString)
pdfFile.close()

这个例子用到了许多pdfminer里的对象，关于pdfminer，如果想掌握的比较好需要看文档，时间关系先往后面看了。

今天的内容到这里啦，打卡~

可惜没有如果

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫学习第二十五天

首先，前一章还有一节内容是讲用python发送邮件的，但是我的电脑配置不好SMTP(simple mail transfer protocol)客户端，所以这部分的练习做不了。下面进入下一章，读取文档。本章重点介绍文档处理的相关内容，包括把文件下载到文件夹里，以及读取文档并提取数据。还会介绍文档的不同编码类型，让程序可以读取非英文的 HTML 页面。不同的网站有可能会用不同的编码方式组织01信
复制链接

扫一扫