前记:正式开始我的Python爬虫之旅
Chapter 1. Your First Web Scraper
1.库函数的安装
本章涉及两个库函数的使用,分别是urllib与BeautifulSoup 4 library(通常也被称为BS4)。前者是Python的标准库,BS4需要自行安装。WIN10系统的安装方法:执行命令pip install beautifulsoup4。过程如下:
D:\PythonProject\webScraping>pip install beautifulsoup4
Collecting beautifulsoup4
Downloading beautifulsoup4-4.5.1-py3-none-any.whl (83kB)
100% |████████████████████████████████| 92kB 67kB/s
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.5.1
D:\PythonProject\webScraping>
2.网页爬取例子
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bsObj = BeautifulSoup(html.read())
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.pythonscraping.com/exercises/exercise1.html")
# bsObj = BeautifulSoup(html.read())
# print(bsObj.h1)
if title == None:
print("Title not found")
else:
print(title)
3.程序的运行结果
a.exercise1.html网页的源码如下
<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>
b.程序的爬取结果如下
<h1>An Interesting Title</h1>
Process finished with exit code 0
4.异常处理说明
html = urlopen(url)
urlopen()函数会涉及两种错误:
1.在服务器上没有找到访问的url页
2.访问的服务器不存在
两种错误的处理方式如下:
第一种,返回HTTP错误:“404 PageNot Found,” “500 Internal Server Error,”等。urlopen()函数会抛出“HTTPError”
第二种,urlopen()函数会返回None
另外写爬虫程序需要考虑到代码处理异常与可读性的平衡