爬虫入门之结构化数据类型XML的提取（BeautifulSoup4）

最新推荐文章于 2024-08-22 12:41:20 发布

村里最靓的仔

最新推荐文章于 2024-08-22 12:41:20 发布

阅读量589

点赞数

文章标签： python 爬虫 BeautifulSoup bs4

本文链接：https://blog.csdn.net/qq_43706512/article/details/100709280

版权

什么是beautiful soup？

和上一篇博客谈的Xpath一样，是python的一个HTML或XML的解析库，可以用它来方便地从网页中提取数据
CSS 选择器：BeautifulSoup4
和 lxml 一样，Beautiful Soup 也是python的一个HTML/XML的解析器，用它可以方便的从网页中提取数据。

lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都
会大很多，所以性能要低于lxml。 BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标
准库中的HTML解析器，也支持 lxml 的 XML解析器。 Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful 
Soup 4。使用 pip 安装即可：pip install beautifulsoup4 官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

各种提取方式的对比

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

Beautiful Soup支持的解析器

解析器	使用方法	优势
Python标准库	BeautifulSoup(markup,‘html.parser’)	Python的内置标准库、执行速度适中、文档容错能力强
lxml HTML解析器	BeautifulSoup(markup,‘lxml’)	速度快、文档容错能力强
lxml XML解析器	BeautifulSoup(markup,‘xml’)	速度快、唯一支持XML的解析器
html5lib	BeautifulSoup(markup,‘html5lib’)	最好的容错性、以浏览器的方式解析文档，生成HTML5的格式文档

bs4在解析时依赖解析器

Python标准库 BeautifulSoup(markup,‘html.parser’) Python内置标准库，执行速度适中，容错能力强
lxml HTML 解析器 BeautifulSoup(markup,‘lxml’) 速度快、文档容错能力强

BeautifulSoup4的使用

基本使用

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# 用来格式化代码，自动补全不完整的HTML代码
print(soup.prettify())
# 获取title标签
print(soup.title)
# title标签的名字
print(soup.title.name)
# title标签的内容
print(soup.title.string)

标签选择器
（1）选择元素

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取title标签
print(soup.title)
# title标签类型
print(type(soup.title))
# 获取head标签
print(soup.head)
# 获取p标签，只输出第一个匹配结果
print(soup.p)

（2）获取名称

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取名称
print(soup.title.name)

（3）获取属性

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="demo"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取属性1
print(soup.p.attrs['name'])
# 获取属性2
print(soup.p['name'])

（4）获取内容

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="demo"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取内容:获取p标签的内容，第一个匹配到的
print(soup.p.string)

（5）嵌套选择

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="demo"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 嵌套选择:获取head标签内的title标签的内容
print(soup.head.title.string)

（6）子节点和子孙节点

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
# 获得p标签的所有子节点，以列表的方式输出
print(soup.p.contents)
# children 也可以返回子节点，与contents不同的是它相当于一个迭代器
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)
# 获取子孙节点descendants
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

（7）父节点和祖先节点

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
# 获取a标签(第一个a标签)的所有父节点
# print(soup.a.parent)
# 获取a标签所有祖先节点
print(list(enumerate(soup.a.parents)))

（8）兄弟节点

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
# 获取a标签(第一个a标签)的所有后面的兄弟节点
print(list(enumerate(soup.a.next_siblings)))
# 获取a标签(第一个a标签)的所有前面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))

标准选择器
1、find_all(name, attrs, recursive, text, **kwargs)
可根据标签名、属性、内容查找文档，返回所有元素

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 查找所有name为ul的标签
print(soup.find_all('ul'))
# 输出第一个ul标签的类型
print(type(soup.find_all('ul')[0]))
# 从ul中循环遍历取出所有的li标签
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

attrs属性

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
# attrs传递的是字典形式
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
print(soup.find_all(id="list-1"))
# 因为class是Python中的一个关键字，所以不能直接用它来传数据
print(soup.find_all(class_="element"))

text属性
做元素查找不方便。

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
# 根据文本内容选择
print(soup.find_all(text='Foo'))

2、find(name, attrs, recursive, text, **kwargs)
find返回单个元素，即匹配到的第一个结果。

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

其他一些类似的方法
（1）、find_parents() VS find_patent()

find_parents() ——返回所有祖先节点
find_patent() ——返回直接父节点

（2）、find_next_siblings() VS find_next_sibling

find_next_siblings() ——返回后面所有兄弟节点
find_next_sibling() ——返回后面第一个兄弟节点

（3）、find_previous_siblings() VS find_previous_sibling()

find_previous_siblings() ——返回前面所有兄弟节点
find_previous_sibling() ——返回前面第一个兄弟节点

（4）、find_all_next() VS find_next()

find_all_next() ——返回节点后所有符合条件的节点
find_next() ——返回第一个符合条件的节点

（5）、find_all_previous() VS find_previous()

find_all_previous() —— 返回节点前所有符合条件的节点
find_previous() —— 返回节点前第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器即可完成选择。

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
# 选择class属性，用空格分隔
print(soup.select('.panel .panel-heading'))
# 选择ul标签中的li标签
print(soup.select('ul li'))
# 选择 ID 中的class为element的标签
print(soup.select('#list-2 .element'))
# 输出第一个ul标签的类型
print(type(soup.select('ul')[0]))

（1）获取属性

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

（2）获取内容

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    # get_text获取内容
    print(li.get_text())