跟着《Python网络数据采集》学爬虫1

最新推荐文章于 2024-07-17 14:00:00 发布

独孤墨殇灬

最新推荐文章于 2024-07-17 14:00:00 发布

阅读量1k

点赞数

分类专栏： python

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

第一章初见网络爬虫

1.1 网络连接

    本节介绍了浏览器获取信息的主要原理，然后举了个python爬取网页源代码的例子

#调用urllib库里的request模块，导入urlopen函数
from urllib.requrest import urlopen
#利用调用的urlopen函数打开并读取目标对象，并把结果赋值给html变量
html = urlopen('http://pythonscrapying.com/pages/page1.html')
#把html中的内容读取并打印出来
print(html.read())

1.2 BeautifulSoup 简介

BeautifulSoup通过定位HTML标签来格式化和组织复杂的网络信息，用简单易用的Python对象为我们展现XML结构信息。

1.21 安装BeautifulSoup

我是在win10下使用的，所以直接在powershell输入

pip install bs4

即可。

1.21 运行BeautifulSoup

同样用第一个例子，只不过这次用bs来实现

#调用urllib库里的request模块的urlopen函数
from urllib.request import urlopen
#调用bs4库里的bs模块（注意大小写）
from bs4 import BeautifulSoup
#利用调用的urlopen函数打开并读取目标对象，并把结果赋值给html变量
html = urlopen('http://pythonscrapying.com/pages/page1.html')
#把html中的内容用bs读取并赋值给bsObj
bsObj = BeautifulSoup(html.read())
#打印出bsObj的h1标签
print(bsObj.h1)

主要是想说明，bs可以对网页信息进行提取

1.23 可靠的网络连接

本节大意为排除爬虫时可能遇到的不可靠因素，防范于未然。
首先

html = urlopen('http://pythonscrapying.com/pages/page1.html')

这行代码主要可能出现两种异常：
1. 网页在服务器上不存在
2. 服务器不存在

当出现第一种异常时，程序会返回HTTP错误。HTTP错误可能是’404 Page Not Found’ ‘500 Internal Server Error’ 异常。我们可以用以下方式处理：

#尝试运行这行代码
try:
    html = urlopen('http://pythonscrapying.com/pages/page1.html')
#如果抛出HTTPError异常
except HTTPError as e:
    #打印出这个异常
    print(e)
    #返回空值，因为默认情况为return None，中断程序，或接着执行另一个方案
#否则
else:
    #程序继续。注意：如果已经抛出了上面的错误，这段else语句不会执行。

如果出现服务器不存在的情况，即域名打不开,urlopen会返回一个None对象。我们可以增加一个判断语句判断返回的html是不是None:

if html is None:
    print('URL is not found')
else:
    #程序继续

当出现对象为None时，如果调用了None下面的子标签会发生AttributeError错误。

try:
    badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent ==None:
        print('Tag was not found')
    else:
    print(badContent)

将上面的代码进行整合，以便于阅读

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print('Title could not be found')
else:
    print(title)