Python爬虫2——BeautifulSoup

最新推荐文章于 2024-06-19 10:36:05 发布

ilgfcyll

最新推荐文章于 2024-06-19 10:36:05 发布

阅读量303

点赞数

分类专栏： Python爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/ilgfcyll/article/details/106961663

版权

Python爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一、什么是BeautifulSoup

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。在爬虫领域用的比较多，能够帮助我们从HTML文件中提取特定的内容，来进行分析。

二、简单实用BeautifulSoup

from bs4 import BeautifulSoup
from urllib.request import urlopen
import lxml

# 返回一个经过lxml解析的BeautifulSoup对象，
# BeautifulSoup 对象表示的是一个文档的全部内容
soup = BeautifulSoup(html, "lxml")

三、BeautifulSoup中的对象

<html>
    <head>
        <title>
            The Dormouse's story
        </title>
    </head>
        <body>
           <p class="title">
               <b>
                   The Dormouse's story
               </b>
           </p>
           <p class="story">
               Once upon a time there were three little sisters; and their names were
               <a class="sister" href="http://example.com/elsie" id="link1">
               Elsie
               </a>
                   ,
               <a class="sister" href="http://example.com/lacie" id="link2">
                   Lacie
               </a>
                   and
               <a class="sister" href="http://example.com/tillie" id="link2">
                  Tillie
               </a>
                  ; and they lived at the bottom of a well.
           </p>
           <p class="story">
              ...
          </p>
    </body>
</html>

1、Tag对象

Tag 对象与XML或HTML原生文档中的tag相同。是一个Tag对象。

from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml')

# 返回一个tag实例，代表整个标签，包括标签的名字，属性和内容
>>> soup.b
<b class="boldest">Extremely bold</b>
>>> type(soup.b)
<class 'bs4.element.Tag'>

# tag为Tag的一个实例，通过BeautifulSoup对象获取
>>> tag = soup.b
# 获取tag对应的属性值
>>> tag["class"]
['boldest']
# 获取tag对应的内容
>>> tag.string
'Extremely bold'
# 获取tag的名字
>>> tag.name
'b'
# 获取tag的全部属性
>>> tag.attrs
{'class': ['boldest']}

2、Navigable对象

字符串就是标签的内容，这在BeautifulSoup中被称之为Navigable对象。例如：<b class="boldest">Extremely bold</b>，中对象为Extremely bold。该对象可以通过tag.string获取。

>>> soup = BeautifulSoup('<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>', 'lxml')
>>> tag = soup.a
>>> tag.string
'Lacie'
>>> tag.name
'a'
>>> tag.attrs
{'class': ['sister'], 'href': 'http://example.com/lacie', 'id': 'link2'}
>>> tag["href"]
'http://example.com/lacie'

3、BeautifulSoup对象

BeautifulSoup 对象表示的是一个文档的全部内容。通过soup.Tag可以返回一个Tag标签对象。

4、Comment对象

四、遍历文档树

子节点

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点。

通过soup.tagName便可以获得该标签的全部内容。例如：

>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# 根据tag名字寻找特定的子节点
>>> soup = BeautifulSoup(html_doc, 'lxml')
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.title
<title>The Dormouse's story</title>
# 存在多个标签时，只输出第一个标签
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 通过.来取得标签内的标签
>>> soup.head.title
<title>The Dormouse's story</title>
# 取标签p时，会取到第一个标签，但是第一个标签p没有a标签，所以输出结果为None
>>> soup.p.a
>>> 
>>> 


# .contents属性获取子节点
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>> soup.a.contents
['Elsie']
>>> soup.html.contents
[<head><title>The Dormouse's story</title></head>, '\n', <body><p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>]

tag属性

# tag的 .contents 属性可以将tag的子节点以列表的方式输出:
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>> soup.a.contents
['Elsie']

#

ilgfcyll

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录