爬虫解析库（9.Beautiful Soup）

最新推荐文章于 2024-06-25 23:49:22 发布

川野先生

最新推荐文章于 2024-06-25 23:49:22 发布

阅读量385

点赞数

分类专栏：高级爬虫案例教程文章标签：爬虫 python 正则表达式

本文链接：https://blog.csdn.net/to_upper/article/details/123969352

版权

高级爬虫案例教程专栏收录该内容

16 篇文章 8 订阅

订阅专栏

Beautiful Soup解析库

Beautiful Soup章节介绍

Beautiful Soup章节介绍

Beautiful Soup提供了多种选择器，使用起来不仅比XPath更加方便，而且更灵活

本章主要内容：
    1、Beautifu Soup的基本概念
    2、安装Beautiful Soup
    3、Beautifu Soup的基本使用方法
    4、节点选择器
    5、方法选择器
    6、CSS选择器
    7、实战案例，requests与Beautifu Soup结合抓取和分析HTML代码

Beautiful Soup的解析横向对比

————————————————————————————————————————————————————————————————————————————————————————————
解析器          使用方法                             优点                            缺点
Python标准库    BeautifulSoup(code,'html.parser')   速度中，容错能力强                Python2.7.3-3.2.2版本容错差
lxmlHTML解析器  BeautifulSoup(code,'lxml')          速度快，容错强                    需安装C语言库
lxmlXML解析器   BeautifulSoup(code,'xml')           速度快，唯一支持XMl               需安装C语言库
html5lib       BeautifulSoup(code,'html5lib')       容错最强，以浏览器的方式解析文档    解析速度慢
                                                     ，生成HTML5格式文档

9.1 bs获取文本和属性方法

创建BeautifulSoup对象，并通过BeautifulSoup类的第2个参数指定lxml解析器，并获取html指定内容

from bs4 import BeautifulSoup
# 定义一段HTML代码
html = '''
<html>
    <head><title>这是一个演示页面</title></head>
    <body>
        <a href='a.html'>第一页</a>
        <p>
        <a href='b.html'>第二页</a>
    </body>
</html>
'''
soup = BeautifulSoup(html,'lxml')
# 获取<title>标签的文本
print('<' + soup.title.string + '>')
# 获取第1个<a>标签的href属性值
print('[' + soup.a["href"] + ']')
# 以格式化后的格式输出这段HTML代码
print(soup.prettify())

9.2 选择节点方法selectnode

获取节点的名称
print(soup.title.name)
获取节点的属性
获取所有属性： print(soup.li.attrs)
获取属性值： print(soup.li.attrs[‘value2’])
获取节点某一个属性: print(soup.li[‘value2’])
获取节点的内容
print(soup.a.string)

from bs4 import BeautifulSoup
html = '''
<html>
<head>
    <meta charset="UTF-8">
    <title>Beautiful Soup演示</title>
</head>
<body>
<div>
    <ul>
        <li class="item1" value1="1234" value2 = "hello world"><a href="https://geekori.com"> geekori.com</a></li>
        <li class="item2"><a href="https://www.jd.com"> 京东商城</a></li>
        <li class="item3"><a href="https://www.taobao.com">淘宝</a></li>
        <li class="item4" ><a href="https://www.microsoft.com">微软</a></li>
        <li class="item5"><a href="https://www.google.com">谷歌</a></li>
    </ul>
</div>
</body>
</html>
'''
soup = BeautifulSoup(html,'lxml')
# 获取title节点的名称
print(soup.title.name)
# 获取第1个li节点的所有属性名和属性值
print(soup.li.attrs)
# 获取第1个li节点的value2属性的值
print(soup.li.attrs["value2"])
# 获取第1个li节点的value1属性的值
print(soup.li["value1"])
# 获取第1个a节点的href属性值
print(soup.a['href'])
# 获取第1个a标签的文本内容
print(soup.a.string)

9.3 allchildnodes获取子节点和子孙节点

直接获取子节点：通过contents属性或者children属性，可以获取当前节点的直接子节点。
contents返回一个list，children返回list_iterator类的实例
获取所有子孙节点：通过descendant属性，赶回一个产生器（generator）

from bs4 import BeautifulSoup
html = '''
<html>
<head>
    <meta charset="UTF-8">
    <title>Beautiful Soup演示</title>
    <tag1><a><b></b></a></tag1>
</head>
<body>
<div>
    <ul>
        <li class="item1" value = "hello world">
            <a href="https://geekori.com"> 
                geekori.com
            </a>
        </li>
        <li class="item2"><a href="https://www.jd.com"> 京东商城</a></li>

    </ul>
</div>
</body>
</html>
'''
soup = BeautifulSoup(html,'lxml')
# 输出head的所有直接子节点
print(soup.head.contents)
print(soup.head.children)
print(type(soup.head.contents))
print(type(soup.body.div.ul.children))
print(type(soup.head.descendants))
# 对ul中的所有子节点进行迭代，并以文本形式输出子节点内容
for i, child in enumerate(soup.body.div.ul.contents):
    print(i, child)
print("-----------------------------")
# 对children迭代式没有使用enumerate函数，所有需要单独定义一个i来保存元素的索引
i = 1
# 对ul中的所有子节点进行迭代，并以文本形式输出子节点的内容
for child in soup.body.div.ul.children:
    print('<{}>'.format(i), child,end=" ")
    i += 1
# 对ul中的所有子孙节点进行迭代，并以文本形式输出子节点的内容
for i, child in enumerate(soup.body.div.ul.descendants):
    print('[{}]'.format(i), child, end=" ")

9.5 parentnodes父节点

使用parents和parent获取父节点及其属性

from bs4 import BeautifulSoup
html = '''
<html>
<head>
    <meta charset="UTF-8">
    <title>Beautiful Soup演示</title>
    <tag1><xyz><b></b></xyz></tag1>
</head>
<body>
<div>
    <ul>
        <li class="item1" value = "hello world">
            <a href="https://geekori.com"> 
                geekori.com
            </a>
        </li>
        <li class="item2"><a href="https://www.jd.com"> 京东商城</a></li>

    </ul>
</div>
</body>
</html>
'''
soup = BeautifulSoup(html,'lxml')
# 获取a节点的直接父节点
print(soup.a.parent)
# 获取a节点的直接父节点的class属性
print(soup.a.parent['class'])
print(soup.a.parents)
# 输出a节点所有的父节点的标签名
for parent in soup.a.parents:
    print('<',parent.name,'>')

9.6 sibling兄弟节点

通过next_sibling属性获得当前节点的下一个兄弟节点，
通过previous_sibling属性获得当前节点的上一个兄弟

html = '''
<html>
<head>
    <meta charset="UTF-8">
    <title>Beautiful Soup演示</title>
</head>
<body>
<div>
    <ul>
        <li class="item1" value1="1234" value2 = "hello world">
            <a href="https://geekori.com"> geekori.com</a>
        </li>
        <li class="item2"><a href="https://www.jd.com"> 京东商城</a></li>
        <li class="item3"><a href="https://www.taobao.com">淘宝</a></li>
        <li class="item4" ><a href="https://www.microsoft.com">微软</a></li>
        <li class="item5"><a href="https://www.google.com">谷歌</a></li>
    </ul>
</div>
</body>
</html>
'''

soup = BeautifulSoup(html,'lxml')
# 得到第2个li节点，soup.li.next_sibling指的是文本节点（包含\n字符）
secondli = soup.li.next_sibling.next_sibling
# 输出第2个li节点的代码
print('第1个li节点的下一个li节点：',secondli)
# 获得第2个li节点的上一个同级的li节点，并输出该li节点的class属性的值
print('第2个li节点的上一个li节点的class属性值：',secondli.previous_sibling.previous_sibling['class'])
# 输出低2个li节点后的所有节点，包括带有换行符的文本节点
for sibling in secondli.next_siblings:
    print(type(sibling))
    if str.strip(sibling.string) == "":
        print("换行")
    else:
        print(sibling)

9.7 find_all_name根据名字查找节点

find系列的方法是非常实用的，博主曾在爬虫项目中多次使用到。

name参数用于指定节点名，find_all方法会选取所有节点名与name参数值相同的节点
返回bs4.element.ResultSet对象，该对象是可迭代的，可以通过迭代获取每一个符合条件的结点

from bs4 import BeautifulSoup
html = '''
<html>
<head>
    <meta charset="UTF-8">
    <title>Beautiful Soup演示</title>
</head>
<body>
<div>
    <ul>
        <li class="item1" value1="1234" value2 = "hello world"><a href="https://geekori.com"> geekori.com</a></li>
        <li class="item2"><a href="https://www.jd.com"> 京东商城</a></li>        
    </ul>
    <ul>
    <li class="item3"><a href="https://www.taobao.com">淘宝</a></li>
        <li class="item4" ><a href="https://www.microsoft.com">微软</a></li>
        <li class="item5"><a href="https://www.google.com">谷歌</a></li>
    </ul>
</div>
</body>
</html>
'''

soup = BeautifulSoup(html,'lxml')
# 搜索所有的ul结点
ulTags = soup.find_all(name='ul')
# 输出ulTags类型
print(type(ulTags))
# 迭代获取所有ul结点对应的Tag对象
for ulTag in ulTags:
    print(ulTag)
print("————————————————————————————————————————————————————————————————————")
# 进行嵌套查询，选取所有的ul结点，然后对每一个ul节点继续选取该节点下的所有li节点
for ulTag in ulTags:
    # 选取当前ul节点下的所有li节点
    liTags = ulTag.find_all(name='li')
    for liTag in liTags:
        print(liTag)

9.8 find_all_attrs根据属性查找节点

通过attrs的方式,specially, 查询class属性值时，后面需要加上下划线（），即class

from bs4 import BeautifulSoup
html = '''
<div>
    <ul>
        <li class="item1" value1="1234" value2 = "hello world"><a href="https://geekori.com"> geekori.com</a></li>
        <li class="item"><a href="https://www.jd.com"> 京东商城</a></li>        
    </ul>
    <button id="button1">确定</button>
    <ul>
        <li class="item3"><a href="https://www.taobao.com">淘宝</a></li>
        <li class="item" ><a href="https://www.microsoft.com">微软</a></li>
        <li class="item2"><a href="https://www.google.com">谷歌</a></li>
    </ul>
</div>

'''
soup = BeautifulSoup(html,'lxml')
# 查询class属性值为item的所有节点
tags = soup.find_all(attrs={"class":"item"})
for tag in tags:
    print(tag)
# 查询class属性值为item2的所有节点
tags = soup.find_all(class_='item2')
print(tags)
# 查询id属性为button1的所有节点
tags = soup.find_all(id='button1')
print(tags)

9.9 find_all_text根据部分文本内容获取节点内容

from bs4 import BeautifulSoup
import re
html = '''
<div>
    <xyz>Hello World, what's this?</xyz>
    <button>Hello, my button. </button>
    <a href='https://geekori.com'>geekori.com</a>
</div>
'''
soup = BeautifulSoup(html,'lxml')
# 搜索文本为geekori.com的节点文本
tags = soup.find_all(text='geekori.com')
print(tags)
# 搜索所有文本包含Hello的节点文本
tags = soup.find_all(text=re.compile('Hello'))
print(tags)

9.10 find方法

find方法和find_all的不同
		1、find用于返回满足条件的第1个节点，find_all会返回满足条件的所有节点
	 	2、find_all方法返回bs4.element.ResultSet对象，find方法返回的是bs4.element.Tag对象

from bs4 import BeautifulSoup

html = '''
<div>
    <ul>
        <li class="item1" value1="1234" value2 = "hello world">
              <a href="https://geekori.com"> geekori.com</a>
        </li>
        <li class="item"><a href="https://www.jd.com"> 京东商城</a></li>        
    </ul>
    <ul>
        <li class="item3"><a href="https://www.taobao.com">淘宝</a></li>
        <li class="item" ><a href="https://www.microsoft.com">微软</a></li>
        <li class="item2"><a href="https://www.google.com">谷歌</a></li>
    </ul>
</div>
'''
soup = BeautifulSoup(html,'lxml')
# 查询class属性值是item的第1个节点
tags = soup.find(attrs='item')
print(type(tags))
print(tags)
print("——————————————————————————————————————————————————")
# 查询class属性值的item的所有节点
tags = soup.find_all(attrs={'class':"item"})
print(type(tags))
for each in tags:
    print(each)

# 补充一些方法
# 1、find_parent和find_parents:                       返回父节点和所有祖先节点
# 2、find_next_sibling和find_next_siblings:           返回前面兄弟节点
# 3、find_previous_sibling和find_previous_siblings:   返回后面兄弟节点
# 4、find_all_next和find_next:                        返回节点后符合条件的节点
# 5、find_previous和find_all_previous:                返回节点前符合条件的节点

9.11 CSSSelector(CSS选择器)

常用的CSS选择器（弱）：
        1、.classname:   选取样式名为classname的节点，也就是class属性值是classname的节点
        2、nodename:     选取节点名为nodename的节点
        3、#idname:      选取id属性值为idname的节点

from bs4 import BeautifulSoup
html = '''
<div>
    <ul>
        <li class="item1" value1="1234" value2 = "hello world"><a href="https://geekori.com"> geekori.com</a></li>
        <li class="item"><a href="https://www.jd.com"> 京东商城</a></li>        
    </ul>
    <button id="button1">确定</button>
    <ul>
        <li class="item3"><a href="https://www.taobao.com">淘宝</a></li>
        <li class="item" ><a href="https://www.microsoft.com">微软</a></li>
        <li class="item2"><a href="https://www.google.com">谷歌</a></li>
    </ul>
</div>
'''

soup = BeautifulSoup(html,'lxml')
# 选取class属性值是item的所有节点
tags = soup.select('.item')
for tag in tags:
    print(tag)
# 选取id属性值是button1的所有节点
tags = soup.select('#button1')
print(tags)
# 选取节点名为a的节点中除了前2个节点外的所有节点
tags = soup.select('a')[2:]
for tag in tags:
    print(tag)

9.12 css_selector_nest 使用CSS选择器和方法选择器混合使用

from bs4 import BeautifulSoup
html = '''
<div>
    <ul>
        <li class="item1" value1="1234" value2 = "hello world"><a href="https://geekori.com"> geekori.com</a></li>
        <li class="item">
           <a href="https://www.jd.com"> 京东商城</a>
           <a href="https://www.google.com">谷歌</a>
        </li>        
    </ul>
    <ul>
        <li class="item3"><a href="https://www.taobao.com">淘宝</a></li>
        <li class="item" ><a href="https://www.microsoft.com">微软</a></li>
    </ul>
</div>
'''

soup = BeautifulSoup(html,'lxml')
# 选取class属性为item的所有节点
tags = soup.select('.item')
# select方法返回列表类型，列表元素类型是Tag对象
print(type(tags))
for tag in tags:
    # 在当前节点中选取节点名为a的所有节点CSS选择器
    aTags = tag.select('a')
    for aTag in aTags:
        print(aTag)
print("————————————————————————————————————————")
for tag in tags:
    # 通过方法选择器选取节点名为a的所有节点
    aTags = tag.find_all(name='a')
    for aTag in aTags:
        print(aTag)

9.13 css_selector_value获取节点属性

使用CSS选择器选取特定的a节点，并获取a节点的href属性值和文本内容
实例中分别使用两种不同的方式获<a>标签中的值，读者注意区分

from bs4 import BeautifulSoup
html = '''
<div>
    <ul>
        <li class="item1" value1="1234" value2 = "hello world">
            <a href="https://geekori.com"> geekori.com</a>
        </li>
        <li class="item">
           <a href="https://www.jd.com"> 京东商城</a>
           <a href="https://www.google.com">谷歌</a>
        </li>        
    </ul>
    <ul>
        <li class="item3"><a href="https://www.taobao.com">淘宝</a></li>
        <li class="item" ><a href="https://www.microsoft.com">微软</a></li>
    </ul>
</div>
'''
soup = BeautifulSoup(html,'lxml')
tags = soup.select('.item')
print(type(tags))
for tag in tags:
    aTags = tag.select('a')
    for aTag in aTags:
        print(aTag['href'],aTag.get_text())
print("————————————————————————————————————————————")
for tag in tags:
    aTags = tag.find_all(name='a')
    for aTag in aTags:
        # 获取a节点的href属性值和文本内容
        print(aTag.attrs['href'],aTag.string)

9.14 css_selector_auto自动获取selector写法

使用request库抓取京东商城首页的HTMl的代码，并使用CSS选择器获取导航条的连接文本。
1、进入京东首页，对准需要爬取的内容
2、打开F12开发这模式，右键Copy selector复制
在这里插入图片描述

甚至可以自动获取XPath的路径，不需要自己写了。

import requests
from bs4 import BeautifulSoup
# 设置请求头（User-Agent）
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
}
# 抓取京东商城首页的HTML代码
result = requests.get('https://www.jd.com',headers)
soup = BeautifulSoup(result.text,'lxml')
# 选取“秒杀”对应的a节点
aTag = soup.select('#navitems-group1 > li.fore1 > a')
print(aTag)
# 输出“秒杀”对应的a节点的文本内容和href属性值
print(aTag[0].string,aTag[0]['href'])
print("————————————————————————————————————————")
# 选取第1个ul节点
group1 = soup.select('#navitems-group1')
# 选取第2个ul节点
group2 = soup.select('#navitems-group2')
# 选取第3个ul节点
group3 = soup.select('#navitems-group3')

# 获取第1个ul节点中所有的a节点
for value in group1:
    aTags = value.find_all(name='a')
    # 输出a节点的文本内容
    for each in aTags:
        print(each.string)

# 获取第2个ul节点中所有的a节点
for value in group2:
    aTags = value.find_all(name='a')
    # 输出a节点的文本内容
    for each in aTags:
        print(each.string)

# 获取第3个ul节点中所有的a节点
for value in group3:
    aTags = value.find_all(name='a')
    # 输出a节点的文本内容
    for each in aTags:
        print(each.string)

9.15 实战案例：抓取酷狗音乐排行榜单

步骤分析：
	 1、分析URL:https://www.kugou.com/yy/rank/home/1-23784.html
                ...                                   2-23784.html
	 2、需要提取的内容：排名、歌名、歌手、时长

from bs4 import BeautifulSoup
import requests
import time
# 本例可以不加请求头（最好加上）
# 抓取网络红歌个榜某一个页面的HTML代码，并提取出感兴趣的信息
def get_info(url):
    wb_data = requests.get(url)
    soup = BeautifulSoup(wb_data.text,'lxml')	
    # 歌曲排名
    ranks = soup.select('span.pc_temp_num')
    # 歌曲名
    titles = soup.select('div.pc_temp_songlist > ul > li > a')
    # 发布时间
    times = soup.select('span.pc_temp_tips_r > span')
    for rank,title,time in zip(ranks,titles,times):
        data = {
            'rank':rank.get_text().strip(),
            # 由于歌手名也包含在title标签中，对字符串操作即可获取歌手名
            'singer':title.get_text().split('-')[0],
            'song':title.get_text().split('-')[1],
            'time':time.get_text().strip()
        }
        print(data)

if __name__ == '__main__':
    # 生成url
    urls = ['https://www.kugou.com/yy/rank/home/{}-23784.html'.format(i) for i in range(1,11)]
    for url in urls:
        get_info(url)

川野先生

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫解析库（9.Beautiful Soup）

Beautiful Soup解析库Beautiful Soup章节介绍Beautiful Soup的解析横向对比9.1 bs获取文本和属性方法9.2 选择节点方法selectnode9.3 allchildnodes获取子节点和子孙节点9.5 parentnodes父节点9.6 sibling兄弟节点9.7 find_all_name根据名字查找节点9.8 find_all_attrs根据属性查找节点9.9 find_all_text根据部分文本内容获取节点内容9.10 find方法9.11 CSSSele
复制链接

扫一扫