文档笔记之BeautifulSoup4库文档

最新推荐文章于 2024-07-15 11:03:30 发布

清炒小瓜

最新推荐文章于 2024-07-15 11:03:30 发布

阅读量277

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/Cruise_liu/article/details/104289304

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

阅读文档：Beautiful Soup 4.2.0 文档
文档来源：Beautiful Soup 4.2.0 文档

简介

Beautiful Soup 4是一个可以从HTML或XML文件中提取数据的Python库。
目前python中Beautiful Soup主要有两个版本BeautifulSoup，BeautifulSoup4。安装时要注意区分。

安装使用

Beautiful Soup 4 通过PyPi发布，可以通过 easy_install 或 pip 来安装。包的名字是 beautifulsoup4，这个包兼容Python2和Python3。如果你安装了anconda，那么anconda已带有了beautifulsoup4了，可以通过cond list查看。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

安装支撑库

Beautiful Soup支持Python标准库中的HTML解析器及lxml、html5lib等一些第三方的解析器。

$ easy_install lxml

$ pip install lxml

$ easy_install html5lib

$ pip install html5lib

例子展示

给一段html代码：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)
print(soup.prettify())

观察解析后的代码，BeautifulSoup整理了html代码块并按标准格式输出。

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

几个简单的浏览结构化数据的方法

soup.title
>>><title>The Dormouse's story</title>

soup.title.name
>>>'title'

soup.title.string
>>>"The Dormouse's story"

soup.find_all('a')
>>>[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
>>><a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

从文档中找到所有<a>标签的链接：

for link in soup.find_all('a'):
    print(link.get('href'))
		# http://example.com/elsie
		# http://example.com/lacie
		# http://example.com/tillie

从文档中获取所有文字内容:

print(soup.get_text())

开始使用

将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄。可以文档及html实例转换为Unicode，再根据指定的解析器解析对应参数，也可以不指定解析器，由BeautifulSoup4自动选择合适的解析器来解析。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"),'html.parser')#文件句柄

soup = BeautifulSoup("<html>data</html>",'lxml')#字符串

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种

Tag——标签对象
NavigableString——可遍历的字符串对象
BeautifulSoup——文档对象
Comment——注释及特殊字符串

Tag对象

Tag提供了很多方法和属性，如name attributes等。
获取：tag.name
修改：tag.name = “blockquote”
获取：tag.attrs或tag[‘class’]
修改相应值只需采用赋值方法即可。

NavigableString对象

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串
获取：tag.string
替换：tag.string.replace_with(“No longer bold”)

BeautifulSoup对象

BeautifulSoup 对象表示的是一个文档的全部内容。BeautifulSoup 对象不是真正的HTML或XML的标签对象，所以它没有name和attribute属性。BeautifulSoup 对象只包含了一个特殊属性 .name，值为 “[document]” 。

Comment对象

文档中具有特殊格式标注的注释及特殊字符串，它是具有特殊格式的NavigableString对象。

遍历文档树

可以通过soup.tag.tag遍历标签名。

soup.title
# <title>The Dormouse's story</title>

soup.body.b
# <b>The Dormouse's story</b>

获取子节点

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']

通过tag的 .children 生成器,可以对tag的子节点进行循环:

for child in title_tag.children:
    print(child)
    # The Dormouse's story

.contents 和 .children 属性仅包含tag的直接子节点
.descendants 属性可以对所有tag的子孙节点进行递归循环

for child in head_tag.descendants:
    print(child)
    # <title>The Dormouse's story</title>
    # The Dormouse's story

获取标签中的字符串

如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。
如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None 。
如果tag中包含多个字符串 [2] ,可以使用 .strings 来循环获取。
输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容。

获取父标签

通过 .parent 属性来获取某个元素的父节点
通过元素的- .parents 属性可以递归得到元素的所有父辈节点

获取兄弟节点

使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点。实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白。
通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出。
.next_element 属性指向解析过程中下一个被解析的对象(字符串或tag)
通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样。

搜索文档树

最常用搜索方法，find() 和 find_all()
find_all( name , attrs , recursive , text , **kwargs )

name 参数
可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉
keyword
参数(自定义关键字)若指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id
的参数,Beautiful Soup会搜索每个tag的”id”属性。
CSS类名
按CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag。
text参数
通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True 。
limit参数
find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果。
recursive 参数
调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False 。

Beautiful Soup 4 提供了find_all() 的简写方式，但不推介。

find( name , attrs , recursive , text , **kwargs )
参数与findall一样，唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果。find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None。