Python网络爬虫初学笔记第一章：初见网络爬虫

Noobfurid

已于 2022-03-28 18:03:55 修改

阅读量601

点赞数

分类专栏： Python网络爬虫文章标签： python 爬虫

于 2022-03-28 17:56:48 首次发布

本文链接：https://blog.csdn.net/Noobfurid/article/details/123801166

版权

Python网络爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本篇笔记适合于有一定Python基础，想快速上手实现简单爬虫的读者。笔记的主要内容和代码来源于《Python网络爬虫权威指南》（(美) 瑞安·米切尔著），欢迎大家讨论和指出笔记中的问题。

第一章初见网络爬虫

1.1 网络连接

1.2 BeautifulSoup 简介

1.2.1 安装BeautifulSoup

1.2.2 运行BeautifulSoup

1.2.3 可靠的网络连接以及异常的处理

第一章初见网络爬虫

1.1 网络连接

网络爬虫最重要的是实现网络连接。在这里我们使用 urllib 的 request库 中的 urlopen方法 实现网络连接。在下面的例子中我们打开了一个url链接，并打印出html解析后的结果

from urllib.request import urlopen

html = urlopen("http://www.pythonscraping.com/pages/page1.html")
print(html.read())

1.2 BeautifulSoup 简介

1.2.1 安装BeautifulSoup

BeautifulSoup 是Python的一个库，最主要的功能是从网页抓取数据，使用 pip 来安装这个库：

pip install beautifulsoup4

我们可以用虚拟环境保存库文件，保证各个项目之间的独立性

virtualenv scrapingEnv

cd scrapingEnv/
source bin/active

1.2.2 运行BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bs = BeautifulSoup(html.read(),'html.parser')
print(bs.h1)

过程：导入urlopen函数，然后调用 html.read() 获取网页的HTML内容。BeautifulSoup还可以使用urlopen直接返回的文件对象，而不需要先调用 .read() 函数。把HTML内容传到BeautifulSoup对象，转换成层级结构。通过点符号进行调用 bs.html.body.h1，bs.body.h1

创建一个BeautifulSoup对象：

bs = BeautifulSoup(html.read(),'html.parser')

第一个参数是该对象所基于的的HTML文本，第二个参数指定了你希望Beautiful用来创建该对象的解析器。常用的解析器有 ‘html.parser' 、 ‘lxml’ 和‘html5lib'。

lxml 和 html5lib 都可以容忍并修正一些错误，但是处理速度较慢。

1.2.3 可靠的网络连接以及异常的处理

使用urlopen函数的时候可能会发生两种异常：

网页在服务器上不存在
服务器不存在

对于第一种错误，程序会返回HTTP错误，使用下面的方式进行处理：

from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
    # 返回空值，中断程序，或者执行另一个方案
else:
    # 程序继续

如果程序返回HTTP错误代码，程序就会显示错误内容，不再执行else语句后面的代码

对于第二种错误，程序会返回urlerror，可增加以下检查代码：

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked!')

如果调用了一个BeautifulSoup对象中不存在的标签，会返回None，但如果调用这个对象的子标签，就会发生AttributeError错误，可以添加以下语句进行检查：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bs = BeautifulSoup(html.read(),'html.parser')
try:
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent == None:
        print('Tag was not found')
    else:
        print(badContent)

重新组织一下代码让它不那么难写：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

在这个例子中我们创建了getTitle函数，它可以返回网页的标题。如果获取网页的时候遇到问题就返回一个None对象。检查错误的过程被封装在try语句内，当出现问题时都会抛出AttributeError

Noobfurid

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python网络爬虫初学笔记第一章：初见网络爬虫

本篇笔记适合于有一定Python基础，想快速上手实现简单爬虫的读者。本章的主要内容是网络链接的打开和安装、使用Beautiful库。笔记的主要内容和代码来源于《Python网络爬虫权威指南》（(美) 瑞安·米切尔著），欢迎大家讨论和指出笔记中的问题。
复制链接

扫一扫