Web Scraping with Python-Chapter1读书笔记

前记:正式开始我的Python爬虫之旅

Chapter 1. Your First Web Scraper

1.库函数的安装

本章涉及两个库函数的使用,分别是urllib与BeautifulSoup 4 library(通常也被称为BS4)。前者是Python的标准库,BS4需要自行安装。WIN10系统的安装方法:执行命令pip install beautifulsoup4。过程如下:

D:\PythonProject\webScraping>pip install beautifulsoup4
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.5.1-py3-none-any.whl (83kB)
    100% |████████████████████████████████| 92kB 67kB/s
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.5.1

D:\PythonProject\webScraping>

2.网页爬取例子

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e:
        return  None
    return title



title = getTitle("http://www.pythonscraping.com/exercises/exercise1.html")

# bsObj = BeautifulSoup(html.read())
# print(bsObj.h1)
if title == None:
    print("Title not found")
else:
    print(title)


3.程序的运行结果

a.exercise1.html网页的源码如下

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>
b.程序的爬取结果如下

<h1>An Interesting Title</h1>
Process finished with exit code 0

4.异常处理说明

html = urlopen(url)
urlopen()函数会涉及两种错误:
1.在服务器上没有找到访问的url页
2.访问的服务器不存在
两种错误的处理方式如下:
第一种,返回HTTP错误:“404 PageNot Found,” “500 Internal Server Error,”等。urlopen()函数会抛出“HTTPError”
第二种,urlopen()函数会返回None

另外写爬虫程序需要考虑到代码处理异常与可读性的平衡


                
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值