爬虫入门之结构化数据类型XML的提取(BeautifulSoup4)

什么是beautiful soup?

和上一篇博客谈的Xpath一样,是python的一个HTML或XML的解析库,可以用它来方便地从网页中提取数据
CSS 选择器:BeautifulSoup4
和 lxml 一样,Beautiful Soup 也是python的一个HTML/XML的解析器,用它可以方便的从网页中提取数据。

lxml 只会局部遍历,而Beautiful Soup 是基于HTML DOM的,会载入整个文档,解析整个DOM树,因此时间和内存开销都
会大很多,所以性能要低于lxml。 BeautifulSoup 用来解析 HTML 比较简单,API非常人性化,支持CSS选择器、Python标
准库中的HTML解析器,也支持 lxml 的 XML解析器。 Beautiful Soup 3 目前已经停止开发,推荐现在的项目使用Beautiful 
Soup 4。使用 pip 安装即可:pip install beautifulsoup4 官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

各种提取方式的对比

抓取工具速度使用难度安装难度
正则最快困难无(内置)
BeautifulSoup最简单简单
lxml简单一般

Beautiful Soup支持的解析器

解析器使用方法优势
Python标准库BeautifulSoup(markup,‘html.parser’)Python的内置标准库、执行速度适中、文档容错能力强
lxml HTML解析器BeautifulSoup(markup,‘lxml’)速度快、文档容错能力强
lxml XML解析器BeautifulSoup(markup,‘xml’)速度快、唯一支持XML的解析器
html5libBeautifulSoup(markup,‘html5lib’)最好的容错性、以浏览器的方式解析文档,生成HTML5的格式文档

bs4在解析时依赖解析器

  • Python标准库 BeautifulSoup(markup,‘html.parser’) Python内置标准库,执行速度适中,容错能力强
  • lxml HTML 解析器 BeautifulSoup(markup,‘lxml’) 速度快、文档容错能力强

BeautifulSoup4的使用

基本使用

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
# 用来格式化代码,自动补全不完整的HTML代码
print(soup.prettify())
# 获取title标签
print(soup.title)
# title标签的名字
print(soup.title.name)
# title标签的内容
print(soup.title.string)


标签选择器
(1)选择元素

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取title标签
print(soup.title)
# title标签类型
print(type(soup.title))
# 获取head标签
print(soup.head)
# 获取p标签,只输出第一个匹配结果
print(soup.p)


(2)获取名称

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取名称
print(soup.title.name)


(3)获取属性

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="demo"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取属性1
print(soup.p.attrs['name'])
# 获取属性2
print(soup.p['name'])


(4)获取内容

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="demo"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 获取内容:获取p标签的内容,第一个匹配到的
print(soup.p.string)


(5)嵌套选择

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="demo"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
# 声明soup对象
soup = BeautifulSoup(html, 'lxml')
# 嵌套选择:获取head标签内的title标签的内容
print(soup.head.title.string)


(6)子节点和子孙节点

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
# 获得p标签的所有子节点,以列表的方式输出
print(soup.p.contents)
# children 也可以返回子节点,与contents不同的是它相当于一个迭代器
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)
# 获取子孙节点descendants
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)


(7)父节点和祖先节点

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
# 获取a标签(第一个a标签)的所有父节点
# print(soup.a.parent)
# 获取a标签所有祖先节点
print(list(enumerate(soup.a.parents)))


(8)兄弟节点

from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')
# 获取a标签(第一个a标签)的所有后面的兄弟节点
print(list(enumerate(soup.a.next_siblings)))
# 获取a标签(第一个a标签)的所有前面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))


标准选择器
1、find_all(name, attrs, recursive, text, **kwargs)
可根据标签名、属性、内容查找文档,返回所有元素

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 查找所有name为ul的标签
print(soup.find_all('ul'))
# 输出第一个ul标签的类型
print(type(soup.find_all('ul')[0]))
# 从ul中循环遍历取出所有的li标签
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))


attrs属性

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
# attrs传递的是字典形式
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
print(soup.find_all(id="list-1"))
# 因为class是Python中的一个关键字,所以不能直接用它来传数据
print(soup.find_all(class_="element"))


text属性
做元素查找不方便。

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
# 根据文本内容选择
print(soup.find_all(text='Foo'))


2、find(name, attrs, recursive, text, **kwargs)
find返回单个元素,即匹配到的第一个结果。

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))


其他一些类似的方法
(1)、find_parents() VS find_patent()

find_parents() ——返回所有祖先节点
find_patent() ——返回直接父节点

(2)、find_next_siblings() VS find_next_sibling

find_next_siblings() ——返回后面所有兄弟节点
find_next_sibling() ——返回后面第一个兄弟节点

(3)、find_previous_siblings() VS find_previous_sibling()

find_previous_siblings() ——返回前面所有兄弟节点
find_previous_sibling() ——返回前面第一个兄弟节点

(4)、find_all_next() VS find_next()

find_all_next() ——返回节点后所有符合条件的节点
find_next() ——返回第一个符合条件的节点

(5)、find_all_previous() VS find_previous()

find_all_previous() —— 返回节点前所有符合条件的节点
find_previous() —— 返回节点前第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器即可完成选择。

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
# 选择class属性,用空格分隔
print(soup.select('.panel .panel-heading'))
# 选择ul标签中的li标签
print(soup.select('ul li'))
# 选择 ID 中的class为element的标签
print(soup.select('#list-2 .element'))
# 输出第一个ul标签的类型
print(type(soup.select('ul')[0]))


(1)获取属性

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

(2)获取内容

from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    # get_text获取内容
    print(li.get_text())


总结

1、推荐使用lxml解析库,必要时使用html.parser
2、标签选择筛选功能弱但是速度快
3、建议使用find()、find_all()查询匹配单个结果或者多个结果
4、如果对CSS选择器熟悉的建议使用select()方法
5、记住常用的获取属性和文本值得方法

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值