Python全栈开发-Python爬虫-06 爬虫框架Beautiful Soup详解

BeautifulSoup详解

一. 简介

BeautifulSoup是一个高效的网页解析库,可以从HTML或XML文件中提取数据

支持不同的解析器,比如,对HTML解析,对XML解析,对HTML5解析

就是一个非常强大的工具,爬虫利器

一个灵感又方便的网页解析库,处理高效,支持多种解析器

利用它就不用编写正则表达式也能方便的实现网页信息的抓取

二. 解析库

解析器使用方法优势劣势
Python标准库BeautifulSoup(markup, “html.parser”)Python的内置标准库、执行速度适中 、文档容错能力强Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器BeautifulSoup(markup, “lxml”)速度快、文档容错能力强需要安装C语言库
lxml XML 解析器BeautifulSoup(markup, “xml”)速度快、唯一支持XML的解析器需要安装C语言库
html5libBeautifulSoup(markup, “html5lib”)最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展

三. 安装

pip install BeautifulSoup4
pip install lxml

四. 基本使用

4.1 标签选择器

4.1.1 .string — 获取文本内容
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
    <p class="title" name="dromouse"><b><span>The Dormouse's story</span></b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
"""

# 1,导包
from bs4 import BeautifulSoup  
#,2,实例化对象
soup = BeautifulSoup(html, 'lxml')  # 参数1:要解析的内容  参数2:解析器

# print(soup.prettify())  # 代码补全

# 通过标签选取,会返回包含标签本身及其里面的所有内容
print(soup.head) # 包含head标签在内的所有内容
print(soup.p) # 返回匹配的第一个结果

print(soup.title.string)  #title是个节点, .string是属性,作用是获取字符串文本
print(soup.html.head.title.string)

运行结果如下:

<head>
<title>The Dormouse's story</title>
</head>
<p class="title" name="dromouse"><b><span>The Dormouse's story</span></b></p>
The Dormouse's story
The Dormouse's story
4.1.2 .name() — 获取标签本身名称
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.title.name)  # 结果为标签本身名称  --> title
print(soup.p.name)  # --> 获取标签名

运行结果如下:

title
p
4.1.3 .attrs() — 通过属性拿属性的值
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title asdas" name="abc" id = "qwe"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/123" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.p.attrs['name'])# 获取p标签name属性的属性值

print(soup.p.attrs['id']) # 获取p标签id属性的属性值

print(soup.p['id']) #第二种写法

print(soup.p['class']) # 以列表得形式保存

print(soup.a['href'])  # 也是只返回第一个值

运行结果如下:

abc
qwe
qwe
['title', 'asdas']
http://example.com/123

4.2 嵌套选择

子节点和子孙节点

一定要有子父级关系

html = """
<html>
    <head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The abc Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.b)  #层层往下找

运行结果如下:

<b>The abc Dormouse's story</b>
4.2.1 .contents 获取标签子节点

.contents 获取标签子节点 以列表形式返回

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 标签选择器只能拿到部分内容 ,不能拿到所有,那如何解决??
print(soup.p.a)
print("-----"*20)
# .contents属性可以将标签的子节点以列表的形式输出 
print(soup.p.contents)  # a是p的子节点,获取P标签所有子节点内容 返回一个list
print("-----"*20)
for i in soup.p.contents:
    print(i)

运行结果如下:

<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
----------------------------------------------------------------------------------------------------
['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']
----------------------------------------------------------------------------------------------------

            Once upon a time there were three little sisters; and their names were
            
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>


<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 
            and
            
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

            and they lived at the bottom of a well.
4.2.2 .children 获取子节点
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# .children是一个list类型的迭代器
print(soup.p.children)  # 获取子节点  返回一个迭代器

# for i in soup.p.children:
#     print(i)

print("----------------------"*5)    

# enumerate() 函数用于将一个可遍历的数据对象添加一个索引序列
#同时列出数据和数据下标,一般用在 for 循环当中
for i, child in enumerate(soup.p.children):  
    print(i, child)

运行结果如下:

<list_iterator object at 0x000001FEC6D6CAF0>
--------------------------------------------------------------------------------------------------------------
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.
4.2.3 descendants 获取子孙节点

descendants 获取子孙节点 返回的是一个生成器

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)  # 获取子孙节点  生成器本身是一种特殊的迭代器
for i, child in enumerate(soup.p.descendants):
    print(i, child)

运行结果如下:

<generator object Tag.descendants at 0x000001FEC6D3D040>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

父节点和祖先节点

4.2.4 .parent获取父节点
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)  # parent获取父节点

运行结果如下:

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
4.2.5 .parents 获取祖先节点

.parents 获取祖先节点 返回的是生成器

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# print(soup.a.parents)  # 获取祖先节点 返回的是生成器,生成器本身是一种特殊的迭代器
      
# print(list(soup.a.parents)) #list是内置的列表类,它有一个构造函数,可以接受一个Iterable(可迭代)的对象作为参数,返回一个列表对象

for i, child in enumerate(soup.a.parents):
    print(i, child)

运行结果如下:

0 <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
1 <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body>
2 <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>
3 <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>

兄弟节点

4.2.6 .next_siblings 获取后面的兄弟节点
.previous_siblings 获取前面的兄弟节点

两者返回的都是一个生成器对象

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            <span>abcqweasd</span>
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.a.next_siblings)
print(list(enumerate(soup.a.next_siblings)))  # 后边的所有的兄弟节点
print('---'*15)
print(list(enumerate(soup.a.previous_siblings))) # 前边的

运行结果如下:

<generator object PageElement.next_siblings at 0x000001FEC6D3D190>
[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
---------------------------------------------
[(0, '\n            Once upon a time there were three little sisters; and their names were\n            '), (1, <span>abcqweasd</span>), (2, '\n')]

4.3 标准选择器

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档

4.3.1 find_all() 根据标签名查找
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo-2</li>
            <li class="element">Bar-2</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

# print(soup.find_all('ul'))  # 拿到所有ul标签及其里面内容
print(soup.find_all('div'))
# print(soup.find_all('ul')[0])

运行结果如下:

[<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo-2</li>
<li class="element">Bar-2</li>
</ul>
</div>
</div>, <div class="panel-heading">
<h4>Hello</h4>
</div>, <div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo-2</li>
<li class="element">Bar-2</li>
</ul>
</div>]
4.3.2 .string 获取文本值
for ul in soup.find_all('ul'):
#     print(ul)
    for i in ul.find_all("li"):
#         print(i)
        print(i.string)

运行结果如下:

Foo
Bar
Jay
Foo-2
Bar-2
4.3.3 get_text() 获取内容
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element2">Foo</li>
            <li class="element2">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

for ul in soup.find_all('ul'):
#     print(ul)
    for i in ul.find_all('li'):
#         print(i)
#         print(i.string)
        print(i.get_text()) # 有时候.string不一定获取的到,可使用get_text()

运行结果如下:

Foo
Bar
Jay
Foo
Bar
4.3.4 find_all() 根据属性查找
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 第一种写法 通过attrs指定属性
# 语法格式:attrs={'属性':'属性名'}
print(soup.find_all(attrs={'id': 'list-1'})) # 根据id属性

# print("-----"*10)
# print(soup.find_all(attrs={'name': 'elements'}))  # 根据name属性
# print("-----"*10)


# for ul in soup.find_all(attrs={'name': 'elements'}):
#     print(ul)  # 从列表中遍历取出
#     print(ul.li.string)  #只返回第一个值,原因:严格遵从层层往下查找
# # # # #     print('-----')
#     for li in ul:
#         print(li) # 都是同级标签
#         print(li.string)
    


# 第二种写法
# 语法格式:(属性='属性名')
# print(soup.find_all(id='list-1'))

# 特殊属性查找
# print(soup.find_all(class='element'))  # 注意:错误举例!!!
# print(soup.find_all(class_='element'))  # class属于Python关键字,做特殊处理 _

# 第三种 推荐的查找方法!!!   --- 指定标签和属性
print(soup.find_all('li',{'class','element'}))  

运行结果如下:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
4.3.5 text=() 根据文本值选择
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 语法格式:text='要查找的文本内容'
print(soup.find_all(text='Foo')) # 可以做内容统计用
print(soup.find_all(text='Bar'))

print(len(soup.find_all(text='Foo'))) # 统计数量

运行结果如下:

['Foo', 'Foo']
['Bar', 'Bar']
2
4.3.6 find( name , attrs , recursive , text , **kwargs )

find返回单个元素find_all返回所有元素

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul')) # 只返回匹配到的第一个
print('---------'*5)
print(soup.find('li'))
print('---------'*5)
print(soup.find('page')) # 如果标签不存在返回None

运行结果如下:

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
---------------------------------------------
<li class="element">Foo</li>
---------------------------------------------
None

总结

1,find_parents() 
2,find_parent()
区别:find_parents()返回所有祖先节点,find_parent()返回直接父节点。

3,find_next_siblings()
4,find_next_sibling()
区别:find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。

5,find_previous_siblings() 
6,find_previous_sibling()
区别:find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。

7,find_all_next() 
8,find_next()
区别:find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

9,find_all_previous() 
10,find_previous()
区别:find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

4.4 CSS选择器

介绍:
1,类别选择器 -- class
2,标签选择器 -- <p></p> 
3,ID选择器  -- id

详情了解:css选择器

使用:

通过select()直接传入CSS选择器即可完成选择

如果对HTML里的CSS选择器很熟悉可以考虑用此方法

注意:
1,用CSS选择器时,标签名不加任何修饰,class类名前加. , id名前加# 

2,用到的方法是soup.select(),返回类型是list

3,多个过滤条件需要用空格隔开,严格遵守从前往后逐层筛选
html='''
<div class="pan">q321312321</div>
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

# 根据标签去找 标签不加任何修饰 多个条件用空格隔开
print(soup.select('ul li'))  
print("----"*10)

# class类名前加.  
print(soup.select('.panel'))
print("----"*10)
# 多个条件用空格隔开
print(soup.select('.panel .panel-heading')) 
print("----"*10)

# 注意:可以混合使用!!
# 比如:根据id和class去找
print(soup.select('#list-1 .element'))  #从这个例子可以看出.select方法会获取满足条件的所有内容
# print("----"*10)

运行结果如下:

<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
----------------------------------------
[<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>]
----------------------------------------
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
----------------------------------------
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
获取属性的值

两种写法:

1,ul['id']

2,ul.attrs['id']

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

for ul in soup.select('ul'):
#     print(ul)
#     print(ul['id'])
#     print(ul['class']) 

    print(ul.attrs['id'])
    print(ul.attrs['class']) 

#以上只是展示两种不同写法

运行结果如下:

list-1
['list']
list-2
['list', 'list-small']

总结

  • 推荐使用lxml解析库
  • 标签选择筛选功能弱但是速度快
  • 建议使用find()、find_all() 查询匹配单个结果或者多个结果
  • 如果对CSS选择器熟悉建议使用select()
  • 记住常用的获取属性和文本值的方法
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值