Python全栈开发-Python爬虫-06 爬虫框架Beautiful Soup详解

最新推荐文章于 2024-06-19 10:36:05 发布

落空空。

最新推荐文章于 2024-06-19 10:36:05 发布

阅读量579

点赞数

分类专栏： python基础 python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/kc44601/article/details/118281140

版权

python基础同时被 2 个专栏收录

31 篇文章 15 订阅

订阅专栏

python

29 篇文章 0 订阅

订阅专栏

BeautifulSoup详解

一. 简介

BeautifulSoup是一个高效的网页解析库，可以从HTML或XML文件中提取数据

支持不同的解析器，比如，对HTML解析，对XML解析，对HTML5解析

就是一个非常强大的工具，爬虫利器

一个灵感又方便的网页解析库，处理高效，支持多种解析器

利用它就不用编写正则表达式也能方便的实现网页信息的抓取

二. 解析库

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快、文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

三. 安装

pip install BeautifulSoup4
pip install lxml

四. 基本使用

4.1 标签选择器

4.1.1 `.string` — 获取文本内容

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
    <p class="title" name="dromouse"><b><span>The Dormouse's story</span></b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
"""

# 1，导包
from bs4 import BeautifulSoup  
#,2，实例化对象
soup = BeautifulSoup(html, 'lxml')  # 参数1：要解析的内容  参数2：解析器

# print(soup.prettify())  # 代码补全

# 通过标签选取，会返回包含标签本身及其里面的所有内容
print(soup.head) # 包含head标签在内的所有内容
print(soup.p) # 返回匹配的第一个结果

print(soup.title.string)  #title是个节点， .string是属性,作用是获取字符串文本
print(soup.html.head.title.string)

运行结果如下:

<head>
<title>The Dormouse's story</title>
</head>
<p class="title" name="dromouse"><b><span>The Dormouse's story</span></b></p>
The Dormouse's story
The Dormouse's story

4.1.2 `.name()` — 获取标签本身名称

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.title.name)  # 结果为标签本身名称  --> title
print(soup.p.name)  # --> 获取标签名

运行结果如下:

title
p

4.1.3 `.attrs()` — 通过属性拿属性的值

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title asdas" name="abc" id = "qwe"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/123" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.p.attrs['name'])# 获取p标签name属性的属性值

print(soup.p.attrs['id']) # 获取p标签id属性的属性值

print(soup.p['id']) #第二种写法

print(soup.p['class']) # 以列表得形式保存

print(soup.a['href'])  # 也是只返回第一个值

运行结果如下:

abc
qwe
qwe
['title', 'asdas']
http://example.com/123

4.2 嵌套选择

子节点和子孙节点

一定要有子父级关系

html = """
<html>
    <head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The abc Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.b)  #层层往下找

运行结果如下:

<b>The abc Dormouse's story</b>

4.2.1 `.contents` 获取标签子节点

.contents 获取标签子节点以列表形式返回

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 标签选择器只能拿到部分内容 ，不能拿到所有，那如何解决？？
print(soup.p.a)
print("-----"*20)
# .contents属性可以将标签的子节点以列表的形式输出 
print(soup.p.contents)  # a是p的子节点，获取P标签所有子节点内容 返回一个list
print("-----"*20)
for i in soup.p.contents:
    print(i)

运行结果如下:

<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
----------------------------------------------------------------------------------------------------
['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']
----------------------------------------------------------------------------------------------------

            Once upon a time there were three little sisters; and their names were
            
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>


<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 
            and
            
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

            and they lived at the bottom of a well.

4.2.2 `.children` 获取子节点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# .children是一个list类型的迭代器
print(soup.p.children)  # 获取子节点  返回一个迭代器

# for i in soup.p.children:
#     print(i)

print("----------------------"*5)    

# enumerate() 函数用于将一个可遍历的数据对象添加一个索引序列
#同时列出数据和数据下标，一般用在 for 循环当中
for i, child in enumerate(soup.p.children):  
    print(i, child)

运行结果如下:

<list_iterator object at 0x000001FEC6D6CAF0>
--------------------------------------------------------------------------------------------------------------
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.

4.2.3 `descendants` 获取子孙节点

descendants 获取子孙节点返回的是一个生成器

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)  # 获取子孙节点  生成器本身是一种特殊的迭代器
for i, child in enumerate(soup.p.descendants):
    print(i, child)

运行结果如下:

<generator object Tag.descendants at 0x000001FEC6D3D040>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

父节点和祖先节点

4.2.4 `.parent`获取父节点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)  # parent获取父节点

运行结果如下:

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>

4.2.5 `.parents` 获取祖先节点

.parents 获取祖先节点返回的是生成器

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# print(soup.a.parents)  # 获取祖先节点 返回的是生成器，生成器本身是一种特殊的迭代器
      
# print(list(soup.a.parents)) #list是内置的列表类，它有一个构造函数，可以接受一个Iterable（可迭代）的对象作为参数，返回一个列表对象

for i, child in enumerate(soup.a.parents):
    print(i, child)

运行结果如下:

0 <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
1 <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body>
2 <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>
3 <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>

兄弟节点

4.2.6 `.next_siblings` 获取后面的兄弟节点

`.previous_siblings` 获取前面的兄弟节点

两者返回的都是一个生成器对象

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            <span>abcqweasd</span>
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.a.next_siblings)
print(list(enumerate(soup.a.next_siblings)))  # 后边的所有的兄弟节点
print('---'*15)
print(list(enumerate(soup.a.previous_siblings))) # 前边的

运行结果如下:

<generator object PageElement.next_siblings at 0x000001FEC6D3D190>
[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
---------------------------------------------
[(0, '\n            Once upon a time there were three little sisters; and their names were\n            '), (1, <span>abcqweasd</span>), (2, '\n')]

4.3 标准选择器

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档

4.3.1 `find_all()` 根据标签名查找

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo-2</li>
            <li class="element">Bar-2</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

# print(soup.find_all('ul'))  # 拿到所有ul标签及其里面内容
print(soup.find_all('div'))
# print(soup.find_all('ul')[0])

运行结果如下:

[<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo-2</li>
<li class="element">Bar-2</li>
</ul>
</div>
</div>, <div class="panel-heading">
<h4>Hello</h4>
</div>, <div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo-2</li>
<li class="element">Bar-2</li>
</ul>
</div>]

4.3.2 `.string` 获取文本值

for ul in soup.find_all('ul'):
#     print(ul)
    for i in ul.find_all("li"):
#         print(i)
        print(i.string)

运行结果如下:

Foo
Bar
Jay
Foo-2
Bar-2

4.3.3 `get_text()` 获取内容

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element2">Foo</li>
            <li class="element2">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

for ul in soup.find_all('ul'):
#     print(ul)
    for i in ul.find_all('li'):
#         print(i)
#         print(i.string)
        print(i.get_text()) # 有时候.string不一定获取的到，可使用get_text()

运行结果如下:

Foo
Bar
Jay
Foo
Bar

4.3.4 `find_all()` 根据属性查找

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 第一种写法 通过attrs指定属性
# 语法格式：attrs={'属性':'属性名'}
print(soup.find_all(attrs={'id': 'list-1'})) # 根据id属性

# print("-----"*10)
# print(soup.find_all(attrs={'name': 'elements'}))  # 根据name属性
# print("-----"*10)


# for ul in soup.find_all(attrs={'name': 'elements'}):
#     print(ul)  # 从列表中遍历取出
#     print(ul.li.string)  #只返回第一个值,原因：严格遵从层层往下查找
# # # # #     print('-----')
#     for li in ul:
#         print(li) # 都是同级标签
#         print(li.string)
    


# 第二种写法
# 语法格式：(属性='属性名')
# print(soup.find_all(id='list-1'))

# 特殊属性查找
# print(soup.find_all(class='element'))  # 注意：错误举例！！！
# print(soup.find_all(class_='element'))  # class属于Python关键字，做特殊处理 _

# 第三种 推荐的查找方法！！！   --- 指定标签和属性
print(soup.find_all('li',{'class','element'}))

运行结果如下:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

4.3.5 `text=()` 根据文本值选择

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 语法格式：text='要查找的文本内容'
print(soup.find_all(text='Foo')) # 可以做内容统计用
print(soup.find_all(text='Bar'))

print(len(soup.find_all(text='Foo'))) # 统计数量

运行结果如下:

['Foo', 'Foo']
['Bar', 'Bar']
2

4.3.6 `find( name , attrs , recursive , text , **kwargs )`

find返回单个元素，find_all返回所有元素

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul')) # 只返回匹配到的第一个
print('---------'*5)
print(soup.find('li'))
print('---------'*5)
print(soup.find('page')) # 如果标签不存在返回None

运行结果如下:

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
---------------------------------------------
<li class="element">Foo</li>
---------------------------------------------
None

总结

1，find_parents() 
2，find_parent()
区别：find_parents()返回所有祖先节点，find_parent()返回直接父节点。

3，find_next_siblings()
4，find_next_sibling()
区别：find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。

5，find_previous_siblings() 
6，find_previous_sibling()
区别：find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。

7，find_all_next() 
8，find_next()
区别：find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

9，find_all_previous() 
10，find_previous()
区别：find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

4.4 CSS选择器

介绍：

1，类别选择器 -- class
2，标签选择器 -- <p></p> 
3，ID选择器  -- id

详情了解：css选择器

使用：

通过select()直接传入CSS选择器即可完成选择

如果对HTML里的CSS选择器很熟悉可以考虑用此方法

注意：

1，用CSS选择器时，标签名不加任何修饰，class类名前加. , id名前加# 

2，用到的方法是soup.select()，返回类型是list

3，多个过滤条件需要用空格隔开,严格遵守从前往后逐层筛选

html='''
<div class="pan">q321312321</div>
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 根据标签去找 标签不加任何修饰 多个条件用空格隔开
print(soup.select('ul li'))  
print("----"*10)

# class类名前加.  
print(soup.select('.panel'))
print("----"*10)
# 多个条件用空格隔开
print(soup.select('.panel .panel-heading')) 
print("----"*10)

# 注意：可以混合使用！！
# 比如：根据id和class去找
print(soup.select('#list-1 .element'))  #从这个例子可以看出.select方法会获取满足条件的所有内容
# print("----"*10)

运行结果如下:

<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
----------------------------------------
[<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>]
----------------------------------------
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
----------------------------------------
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]

获取属性的值

两种写法：

1，ul['id']

2，ul.attrs['id']

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

for ul in soup.select('ul'):
#     print(ul)
#     print(ul['id'])
#     print(ul['class']) 

    print(ul.attrs['id'])
    print(ul.attrs['class']) 

#以上只是展示两种不同写法

运行结果如下:

list-1
['list']
list-2
['list', 'list-small']

总结

推荐使用lxml解析库
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法

落空空。

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python全栈开发-Python爬虫-06 爬虫框架Beautiful Soup详解

BeautifulSoup详解一. 简介BeautifulSoup是一个高效的网页解析库，可以从HTML或XML文件中提取数据支持不同的解析器，比如，对HTML解析，对XML解析，对HTML5解析就是一个非常强大的工具，爬虫利器一个灵感又方便的网页解析库，处理高效，支持多种解析器利用它就不用编写正则表达式也能方便的实现网页信息的抓取二. 解析库解析器使用方法优势劣势Python标准库BeautifulSoup(markup, “html.parser”)Pytho
复制链接

扫一扫