使用requests库和beautifulsoup4库爬虫

最新推荐文章于 2024-07-28 11:40:44 发布

csdndscs

最新推荐文章于 2024-07-28 11:40:44 发布

阅读量9.4k

点赞数 6

一、简介

网络爬虫一般分为两个过程：

通过网络连接获取网页内容，即以HTML语言写成的网页源代码，具有此功能的函数库有urllib、urllib2、urllib3、wget、scrapy、requests等。
对获得的网页内容进行处理，可通过re（正则表达式）、beautifulsoup4等函数库来处理。

下面介绍最重要且最主流的requests和beautifulsoup4函数库。

首先在命令行采用pip或pip3指令安装requests库和beautifulsoup4库：

:\>pip install requests #或pip3 install requests
······
······
:\>pip install beautifulsoup4 #或pip3 install beautifulsoup4

二、requests库的使用

get()是获取网页最常用的方式，在调用requests.get()函数后，返回的网页内容会保存为一个Response对象。其中，get()函数的参数url链接必须采用HTTP或HTTS方式访问，例如：

>>> import requests
>>> r=requests.get("http://www.baidu.com") #使用get方法打开百度链接
>>> type(r)
<class 'requests.models.Response'> #返回Response对象

requests.get()代表请求过程，返回的Response对象代表响应。返回内容作为一个对象便于操作，Response对象的属性如下，需要采用<a>.<b>的形式。

>>> r=requests.get('http://www.baidu.com')
>>> r.status_code #返回状态
200
>>> r.text #观察返回的内容，中文字符是否能正常显示
（输出略）
>>> r.encoding #默认的编码方式是ISO-8859-1，所以中文是乱码
'ISO-8859-1'
>>> r.encoding='utf-8' #更改编码方式为utf-8
>>> r.text #更改完成，返回内容中的中文字符可以正常显示了
（输出略）

有时get()函数返回状态不是200，可参考爬虫时requests.get()响应状态码不是[200]怎么办？

json()方法能够在HTTP响应内容中解析存在的JSON数据。

raise_for_status()方法能在非成功响应后产生异常，即只要返回多请求状态status_code不是 200，这个方法会产生一个异常，用于try-except语句。使用异常处理可以避免设置一堆复杂的if语句，只需在收到响应时调用这个方法，就可避免状态字200以外的各种意外情况。

requests会产生几种常用异常：

当遇到网络问题时，如DNS查询失败、拒绝连接等，requests会抛出ConnectionError异常；
遇到无效HTTP响应时，requests会抛出HTTPError异常；
若请求url超时，抛出Timeout异常；
若请求超过了设定的最大重定向次数，则会抛出TooManyRedirects异常。

获取一个网页内容的函数建议采用如下代码的第二行到第9行，第10行和第11行是测试代码。

import requests
def getHTMLText():
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status() #如果状态不是200，引发异常
        r.encoding='utf-8' #无论原来用什么编码，都改成utf-8
        return r.text
    except:
        return ""
url='http://www.baidu.com'
print(getHTMLText(url))

三、beautifulsoup4库的使用

beautifulsoup4库也称Beautiful Soup或bs4库，采用面向对象思想实现，库中最主要的是BeautifulSoup类。采用from-import导入库中的Beautifulsoup类后，使用BeautifulSoup()创建一个BeautifulSoup对象。

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r=requests.get("http://www.baidu.com")
>>> r.encoding="utf-8" #为了简化代码，没有考虑异常情况
>>> soup=BeautifulSoup(r.text) #soup就是一个BeauifulSoup对象
>>> type(soup)
<class 'bs4.BeautifulSoup'>
>>> soup
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta content="text/html;charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>百度一下，你就知道</title></head> <body link="#0000cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><input autocomplete="off" autofocus="" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span><span class="bg s_btn_wr"><input class="bg s_btn" id="su" type="submit" value="百度一下"/></span> </form> </div> </div> <div id="u1"> <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a> <a class="mnav" href="http://www.hao123.com" name="tj_trhao123">hao123</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a> <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a> <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧</a> <noscript> <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品</a> </div> </div> </div> <div id="ftCon"> <div id="ftConw"> <p id="lh"> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p> <p id="cp">©2017 Baidu <a href="http://www.baidu.com/duty/">使用百度前必读</a>  <a class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈</a> 京ICP证030173号  <img src="//www.baidu.com/img/gs.gif"/> </p> </div> </div> </div> </body> </html>

创建的BeautifulSoup对象是一个树形结构，它包含HTML页面中的每一个Tag（标签）元素，如<head>、<body>等。具体来说，HTML中的主要结构都变成了Beautiful对象的一个属性，直接用<a>.<b>形式获得，其中<b>的名字采用HTML中标签的名字，下表为BeautifulSoup对象的常用属性。

>>> soup.head #略去<style>标签输出
<head><meta content="text/html;charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>百度一下，你就知道</title></head>
>>> title=soup.title
>>> title
<title>百度一下，你就知道</title>
>>> type(title) #每个对应HTML Tag的属性是一个Tag类型
<class 'bs4.element.Tag'>
>>> soup.p
<p id="lh"> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p>

每一个Tag标签在beautifulsoup4库中也是一个对象，称为Tag对象。上例中，title是一个标签对象，每个标签对象在HTML中都有类似的结构：

其中，尖括号（<>）中标签的名字是name，尖括号内其他项是attrs，尖括号之间的内容是string。因此，可以通过Tag对象的name、attrs和string属性获得相应的内容，采用<a>.<b>的语法形式。标签Tag有4个常用属性，如下：

>>> soup.a
<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>
>>> soup.a.name
'a'
>>> soup.a.attrs
{'href': 'http://news.baidu.com', 'name': 'tj_trnews', 'class': ['mnav']}
>>> soup.a.string
'新闻'
>>> title.name #title变量在上段例子中已经定义
'title'
>>> title.string
'百度一下，你就知道'
>>> soup.p.contents
[' ', <a href="http://home.baidu.com">关于百度</a>, ' ', <a href="http://ir.baidu.com">About Baidu</a>, ' ']

由于HTML语法可以在标签中嵌套其他标签，所以，string属性的返回值遵循如下原则：

如果标签内部没有其他标签，string属性返回其中的内容。
如果标签内部还有其他标签，但只有一个标签，string属性返回最里面标签的内容。
如果标签内部有超过1层嵌套的标签，string属性返回None（空字符串）。

HTML语法中同一个标签会有很多内容，例如<a>标签，百度首页一共有多处，直接调用soup.a只能返回第一个。当需要列出标签对应的所有内容或者需要找到非第一个标签时，需要用到BeautifulSoup的find()和find_all()方法。这两个方法会便利整个HTML文档，按照条件返回标签内容。

BeautifulSoup.find_all(name,attrs,recursive,string,limit)

作用：根据参数找到对应标签，返回列表类型。

参数如下。

name:按照Tag标签名字检索，名字用字符串形式表示，如idv、li。

attrs:按照Tag标签属性值检索，需要列出属性名称和值，采用JSON表示。

recursive:设置查找层次，只查找当前标签下一层时使用recursive=False。

string:按照关键字检索string属性内容，采用string=开始。

limit：返回结果的个数，默认返回全部结果。

>>> a=soup.find_all('a') #查找所有的<a>
>>> len(a)
11
>>> soup.find_all('script')
[<script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script>]
>>> import re #使用正则表达式库，可以用这个库实现字符串片段匹配
>>> soup.find_all(string=re.compile('百度'))
['百度一下，你就知道', '关于百度', '使用百度前必读']

来源：嵩天，礼欣，黄天羽.《Python语言程序设计基础第2版》，高等教育出版社