python爬虫基础(二)～工具包: 下载包requests、urllib和解析包BeautifulSoup(bs4)、lxml.etree.xpath

天狼啸月1990

已于 2024-06-06 10:08:26 修改

阅读量1.0k

点赞数 1

分类专栏：爬虫文章标签：爬虫工具包

于 2021-05-30 09:05:07 首次发布

本文链接：https://blog.csdn.net/qq_33419476/article/details/117394430

版权

爬虫专栏收录该内容

5 篇文章 1 订阅

订阅专栏

2.1 BeautifulSoup(bs4)工具包

2.1.1 BeautifulSoup_object.find()抽取标签方法

2.1.2 BeautifulSoup_object.find_all()抽取标签方法

2.1.3 BeautifulSoup.select()抽取标签方法

2.1.4 BeautifulSoup_object获取标签文本、属性值方法

2.1.5 BeautifulSoup_object获取同级标签(兄弟节点)方法

2.1.6 BeautifulSoup_object获取子孙、祖先节点

2.1.7 BeautifulSoup_object节点的删除、插入和替换方法

2.1.8 bs4错误一

2.2 lxml.etree.HTML工具包

2.2.1 lxml.etree.xpath抽取标签方法

1. html下载工具包

1.1 urllib工具包

urllib.parse.quote(content) <--因为url只允许一部分ascii字符，其他字符(如汉子)是不符合标准的，此时就要进行编码。
urllib.request.Request --> urlopen()方法可以实现最基本构造HTTP请求的方法，但如果加入headers等信息，就可以利用Request类来构造请求。

方法：urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverificable=False, method=None)

headers: 请求头，字典类型。用来伪装浏览器，默认是User-Agent python-urllib。也可伪装火狐浏览器，
headers = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
method：'GET', 'POST', 'PUT'

    # 访问、下载html网页
    url = 'https://baike.baidu.com/item/' + urllib.parse.quote(content)      # 请求地址
    # 请求头部，伪造浏览器，防止爬虫被反
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    # 利用请求地址和请求头部构造请求对象
    req = urllib.request.Request(url=url, headers=headers, method='GET')
    response = urllib.request.urlopen(req)      # 发送请求，获得响应
    text = response.read().decode('utf-8')      # 读取响应，获得文本

模块urllib和urllib2的功能差不多，简单来说urllib2是urllib的增强——urllib2更好一些，但是urllib中有urllib2中所没有的函数。对于简单的下载， urllib绰绰有余。如果需要实现HTTP身份验证或Cookie亦或编写扩展来处理自己的协议，urllib2可能是更好的选择。在Python2.x中主要为urllib和urllib2，这两个标准库是不可相互替代的。但是在Python3.x中将urllib2合并到了urllib，这一点值得注意。
————————————————
版权声明：本文为CSDN博主「IoneFine」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/jiduochou963/article/details/87564467

1.1.1 urllib错误一

urllib.parse.quote(content) Failed to establish a new connection: [Errno 61] Connection refused')

原因：服务器没启动！手动滑稽。。。。

1.2 Requests工具包

Requests是用python编写的，基于urllib，采用Apache2 Licensed开源协议的http库。它比url更方便。它支持python3

1.2.1 requests错误一

requests.obj.json()出现错误

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

原因：当我们在爬取一些网页时，有些网页的内容是通过Unicode字符编码进行传输的，

解决：

比如爬虫解码法：
1 import requests
2 
3 reps = requests.get(url=url)
4 reps.content.decode("utf-8")
5 #或者使用这条语句  reps.content.decode("unicode_escape")

2. html解析工具包

2.1 BeautifulSoup(bs4)工具包

中文官方文档：Beautiful Soup 4.12.0 文档 — Beautiful Soup 4.12.0 documentation

BS4，全称是Beautiful Soup，它提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。

它是一个工具箱，通过解析文档为soup自动将输入文档转换为unicode编码，输出文档转换为utf-8编码

Tag对象。html中的一个标签，用BeautifulSoup就能解析出Tag的具体内容，具体格式为soup.name
BeautifulSoup对象。整个html文本对象，可当作Tag对象
NavigableString对象。
Comment对象。

BeautifulSoup对象声明方法：字符串、在线网页、html文件

将bs4.element.Tag转换成字符串：使用str()进行强制转换（Python真香！）

将字符串str转换成bs4.element.Tag，需要以字符串形式用BeautifulSoup重新声明

# br_soup = BeautifulSoup(str(br), 'lxml')

# print(type(br_soup))

source：Python爬虫：如何创建BeautifulSoup对象_beautifulsoup c++-CSDN博客

html = '<div>text1</div>'
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
html = open('c:\\aa.html')

#以上三行表示了HTML的三种来源，一是字符串形式，二是在线网页形式，三是HTML文件形式

bsObj = BeautifulSoup(html, 'html.parser') # 'html.parser'是解析器，也可以用'lxml'
# BeautifulSoup类似于C++中的构造函数
e = bsObj.find('div')
print(e.text)

2.1.1 BeautifulSoup_object.find()抽取标签方法

find()方法只返回当前标签下的第一个匹配子标签，返回一个tag标签。

2.1.2 BeautifulSoup_object.find_all()抽取标签方法

find_all()方法返回当前标签下的所有匹配子标签的结果，返回一个标签列表。如，

title = soup.find_all('div', class_='basicInfo_item name')

--> 注意：只有class属性名要有class_这个下横线

find_all()方法支持嵌套查询，不仅bs4对象可以调用，而且tag标签也可以调用。

for ul in soup.find_all(name = 'ul'):
    print(ul.find_all(name='li'))

find_all(name, attrs, recursive, text, **kwargs)

name参数。可以查找所有名字为name的标签，字符串对象会被自动忽略。如：find_all(name='title') 或 find_all('title')
搜索指定名字的属性时，可以使用的参数值包括字符串、正则表达式、列表、True。如：find_all(attrs={'id', 'link2'}) 或 find_all(id='link2'); find_all(href=re.compile('elsie')); 组合查找～find_all('div', class_='abcd')。使用多个指定名字的参数可以同时过滤标签的多个属性，find_all(href=re.compile('elsie'), id='link1')
attrs参数。定义一个字典参数来搜索包含特殊属性的tag。如：find_all(attrs={'data-foo': 'value'})
text参数。可以搜索文档中的字符串内容，接受字符串、正则、列表、True。如：find_all(text='Elsie'); find_all(text=['Tillie', 'Elsie', 'Lacie']); find_all(text=re.compile('link'))
与其他参数混合使用。find_all('a', text='Elsie')

    # 读取响应，获得文本
    text = response.read().decode('utf-8')
    # 解析html网页
    soup = BeautifulSoup(text, 'lxml')  # 创建soup对象，获取html源码

    intro_tag = soup.find_all('div', class_="lemma-summary")  # 获取百科基本信息列表
    name_tag = soup.find_all('dt', class_="basicInfo-item name")  # 找到所有dt标签，返回一个标签列表
    value_tag = soup.find_all('dd', class_="basicInfo-item value")  # 找到所有dd标签，返回一个标签列表

2.1.3 BeautifulSoup.select()抽取标签方法

select()方法返回类型的标签列表

通过标签名查找。如：soup.select('title')
通过类名查找(class)。如：soup.select('.sister')
通过id名查找。如：soup.select('#link1')
通过组合查找。组合查找时，标签名、类名、id名格式不变，只是二者之间用空格分开。如：soup.select('p #link1')
子标签查找。soup.select('head>title')，注意，子标签查找不支持属性筛选或组合查找
属性查找。查找时还可以加入属性元素，属性需要用中括号括起来，注意属性与标签属于同一节点，所以中间不能加空格！！！否则无法匹配到

如：soup.select('a[href='http://example.com.elsie']')，属性查找也可用于组合查找

2.1.4 BeautifulSoup_object获取标签文本、属性值方法

<a class= "lemma-album layout-right nslog: 10000206" href= "url"> hello, world
    <img class= "picture" src= "url">
    </img>
</a>

tag.get_text()方法 --> 获取当前tag中包含的文本内容包含子节点中的文本, "hello, world"。tag.string方法，获取当前节点中的文本，但如果当前节点包含子节点，.string会引起混乱，返回none。
tag.get('href'), tag.get('class') 或 tag['id'], tag.attrs['id']--> 获取本标签的class属性值，无法获得子标签的属性值。子标签属性值方法获取参考bs4_object.select()

-->BeautifulSoup如何获取不包含子节点文本的文本？

contents属性返回当天标签的直接子节点，返回结果时列表形式，你可以根据索引索取你想要的标签节点或文本。

# contents返回的结果列表
[<span class="title-prefix">潘建伟</span>, '人物履历']
 print(i.find(class_='title-text').contents[1])

2.1.5 BeautifulSoup_object获取同级标签(兄弟节点)方法

next_sibling和next_siblings，分别获取当前节点的下一个兄弟节点和后面所有兄弟节点的生成器
find_next_siblings()和find_next_sibling()：前者返回后面所有的兄弟节点，后者返回后面第一个兄弟节点
find_all_next()和find_next()：前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。
previous_sibling和previous_siblings，分别获取当前节点的前一个兄弟节点和前面所有兄弟节点的生成器
find_previous_siblings()和find_previous_sibling()：前者返回前面所有的兄弟节点，后者返回前者第一个兄弟节点
find_all_previous()和find_previous()：前者返回节点前所有符合条件的节点，后者返回第一个符合条件的节点。
BeautifulSoup(sibling_html, 'html.parser') 解析正常，而lxml可能存在解析异常

sibling_soup = BeautifulSoup(sibling_html, 'html.parser')
br = sibling_soup.p
while br.next_sibling != None:
    print br
    br = br.next_sibling
---------------------------------------------------------------
for tag in soup.select('div .col-md-4'):
    if tag.get_text() == 'Total':
        result = tag.next_sibling.get_text()

--> 判断each br in 返回的兄弟标签列表是否是标签，因为有些兄弟节点为空。

            for br in i.next_siblings:  # 获取人物履历标签后面所有的兄弟标签
            # print(br)
            if type(br) is bs4.element.Tag:  # 判断br是不是一个标签
                attrs = ''.join(br.attrs['class'])
                if attrs == 'para':
                    br_text_list.append(br.get_text())
                elif attrs == re.compile('para-title level'):
                    break
            else:
                continue

2.1.6 BeautifulSoup_object获取子孙、祖先节点

children属性，返回直接子节点生成器；descendants属性，会递归查询所有子节点，得到所有的子孙节点。
parent属性，获取某一个元素节点的父节点；parents属性，获取所有祖先节点。
find_parents()和find_parent()：前者返回所有祖先节点，后者返回直接父节点。

2.1.7 BeautifulSoup_object节点的删除、插入和替换方法

参考：beautifulsoup中文官方文档，Beautiful Soup 4.12.0 文档 — Beautiful Soup 4.12.0 documentation

Tag.clear() 方法移除当前tag的内容:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a

tag.clear()
tag
# <a href="http://example.com/"></a>

PageElement.extract() 方法将当前tag移除文档树,并作为方法结果返回:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

i_tag = soup.i.extract()

a_tag
# <a href="http://example.com/">I linked to</a>

i_tag
# <i>example.com</i>

print(i_tag.parent)
None

这个方法实际上产生了2个文档树: 一个是用来解析原始文档的 BeautifulSoup 对象,另一个是被移除并且返回的tag.被移除并返回的tag可以继续调用 extract 方法:

Tag.decompose() 方法将当前节点移除文档树并完全销毁:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

soup.i.decompose()

a_tag
# <a href="http://example.com/">I linked to</a>

PageElement.replace_with() 方法移除文档树中的某段内容,并用新tag或文本节点替代它:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)

a_tag
# <a href="http://example.com/">I linked to <b>example.net</b></a>

replace_with() 方法返回被替代的tag或文本节点,可以用来浏览或添加到文档树其它地方

2.1.8 bs4错误一

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
原因：不可以使用 BeautifulSoup(html,'lxml')，没有安装lxml导致bs4不能使用lxml

解决：pip3 install lxml

2.2 lxml.etree.HTML工具包

2.2.1 lxml.etree.xpath抽取标签方法

/ -->类似于find()，// -->类似于find_all()，后面跟标签名，[@ ] --> @后面跟属性名

# class属性抽取标签 + 提取标签href属性值
link_list = html_et.xpath('//div[@class="main-content"]//a/@href')

# class属性抽取标签 + 提取标签文本text
text_list = html_et.xpath('//div[@class="lemma-summary"]//text()')

# 模糊定位starts-with方法
ele = etree.xpath("//input[starts-with(@class, "tag")]")  # 获得class= tagyou

# 模糊定位ends-with方法
ele = etree.xpath("//input[ends-with(@class, "tag")]")  # 获得class= youtag

# 模糊定位contains方法
ele = etree.xpath("//input[contains(@class, "tag")]")  # 获得class= youtagyou

# 模糊定位-使用任意值来匹配属性元素
ele = etree.xpath("//input[@*="tag"]")

# 使用索引定位元素
ele = etree.xpath("/a/b/input[4]") 

# 因为索引定位可能出现元素变动，如：input[4], input[3]，所以使用last()最后一个元素索引定位
ele = etree.xpath("/a/b/input[last()]")

使用lxml前注意，先确保html经过了utf-8解码，即code = html.decode('utf-8', 'ignore')，否则会出现解析出错的情况

--> html网页源码的字符编码(charset)格式包括：GB2312, GBK, UTF-8, IOS8859-1等。

    # 读取响应，获得文本
    text = response.read().decode('utf-8')
    # 构造 _Element 对象
    html = etree.HTML(text)
    # 使用 xpath 匹配数据，得到匹配字符串列表
    sen_list = html.xpath('//div[contains(@class,"lemma-summary")]//text()')
    # sen_list = html.xpath('//div[@class="lemma-summary"]//text()')
    # 过滤数据，去掉空白
    sen_list_after_filter = [item.strip('\n') for item in sen_list if item != '\n']
    # 将字符串列表连成字符串并返回
    return ''.join(sen_list_after_filter)