python库——BeautifulSoup

最新推荐文章于 2021-08-04 08:27:39 发布

叶柖

最新推荐文章于 2021-08-04 08:27:39 发布

阅读量340

点赞数

分类专栏： python 文章标签： python html

本文链接：https://blog.csdn.net/qq_38929220/article/details/83543073

版权

python 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

beautifulsoup可以将html内容解析为soup文档。将不具备良好html格式的网页转化为完整的html文档。
究竟什么是html的完整格式？那么在此之前先介绍一下html。

HTML

HTML是一种超文本标记语言，并不是编程语言。它常与CSS、JavaScript一起用于设计网页、网页应用程序以及移动应用程序的用户界面。

标签
标签是html的重要成分，通常成对的出现，两个标签之间为元素的内容。

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p>Hello world!</p>
  </body>
</html>

<html>和</html>之间的文本描述网页，<body>和</body>之间的文本为可视页面内容。
头部<head>...</head>包含标题。
标记文本<title>This is a title</title>定义了浏览器的页面标题。
标题分为<h1>到<h6>六级，字体依次由大到小。
段落写在<p>...</p>中。
<br>换行。
<a>创建链接。

<a href="https://zh.wikipedia.org/">中文維基百科的連結！</a>

href属性包含链接的url地址。

属性
了解html的属性对python爬虫有重要意义。
1.id : id是元素在全文档的唯一标识，用于识别元素。
2.class : class属性提供一种将类似元素分类的方式。
3.style : style将可以表现的性质赋给一个特定的元素。
4.title : title属性给元素一个附加说明。
5.lang : lang用于识别元素内容的语言。

例：

<abbr id="ID" class="术语" style="color:purple;" title="超文本标记语言">HTML</abbr>

abbr为缩写元素。

BeautifulSoup的使用

beautifulsoup可以将不良html格式的网页解析为完整的html文档，并能按照标准的缩进格式的结构输出

>>>from bs4 import BeautifulSoup
>>>broken_html = '<ul class=shop><li>Price<li>Number</ul>'
>>>#解析此html
>>>soup = BeautifulSoup(broken_html,'html.parser'>)
>>>fixed_html = soup.prettify()
>>>print(fixed_html)
<ul class="shop">
 <li>
  Price
  <li>
   Number
  </li>
 </li>
</ul>

以一段html文档来举例说明它的使用方法：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)
print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

soup.title
# <title>The Dormouse's story</title>
soup.title.string
#"The Dormouse's story"
soup.title.parent.name
#'head'
soup.p
#<p class="title"><b>The Dormouse's story</b></p>
soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

tag对象与html中tag对象属性相同

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.name
# 'b'
#修改tag名称可以修改生成的html文档，tag的属性可以被添加,删除或修改
tag.name = "blockquote"
tag
#<blockquote class="boldest">Extremely bold</blockquote>
tag['id'] = 1
tag
#<blockquote class="boldest" id="1">Extremely bold</blockquote>

由于BeautifulSoup模块是纯python编写而正则模块是C语言编写的，与正则表达式相比BeautifulSoup抓取速度要慢很多，但其语法比正则表达式要简单易懂的多，上手简单，推荐新手使用。

参考文献：html维基百科： https://zh.wikipedia.org/wiki/HTML
BeautifulSoup4.2.0文档： https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

叶柖

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
python库——BeautifulSoup

beautifulsoup可以将html内容解析为soup文档。将不具备良好html格式的网页转化为完整的html文档。究竟什么是html的完整格式？那么在此之前先介绍一下html。HTMLHTML是一种超文本标记语言，并不是编程语言。它常与CSS、JavaScript一起用于设计网页、网页应用程序以及移动应用程序的用户界面。标签标签是html的重要成分，通常成对的出现，两个标签之间为...
复制链接

扫一扫