打开 bs4 之门

最新推荐文章于 2024-01-23 17:55:23 发布

dsjdjsa

最新推荐文章于 2024-01-23 17:55:23 发布

阅读量3.4k

点赞数 3

分类专栏： Python-网页爬取文章标签： Python bs4 网页爬取

本文链接：https://blog.csdn.net/qq_21046135/article/details/71039587

版权

Python-网页爬取专栏收录该内容

9 篇文章 0 订阅

订阅专栏

一、BeautifulSoup4 库测试

文档地址：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

安装方法：pip install beautifulsoup4

用 BeautifulSoup 解析一段HTML代码并输出

>>> from bs4 import BeautifulSoup
>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
>>> soup = BeautifulSoup(html_doc, 'lxml')
>>> print(soup.prettify())
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
>>>

二、bs4库解析器

Beautiful Soup支持 Python 标准库中的 HTM L解析器, 还支持一些第三方的解析器

文档推荐使用 lxml 作为解析器, 因为效率更高

这里写图片描述

三、BeautifulSoup 类

使用 BeautifulSoup 解析一段代码, 能够得到一个 BeautifulSoup 的对象，BeautifulSoup 对象表示的是一个文档的全部内容.

soup = BeautifulSoup(html_doc, 'lxml')

四、BeautifulSoup类的基本元素

BeautifulSoup
…………Tag : Tag 对象与 XML 或 HTML 原生文档中的 tag 相同，表示标签，是最基本的信息组织单元，分别用<>和</>开头和结尾
…………<Tag>.name : Name 表示标签的名字
…………<Tag>.attrs : Attributes 表示标签的属性，返回字典形式组织，
…………<Tag>.string : NavigableString 表示标签内非属性字符串
…………<Tag>.string : Comment 表示标签内的注释部分

Tag实例

>>> tag1 = soup.a
>>> print(tag1)
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> tag2 = soup.p
>>> print(tag2)
<p class="title"><b>The Dormouse's story</b></p>

Name实例

>>> tag1.name
'a'
>>> tag2.name
'p'

Attributes 实例

>>> tag1.attrs
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
>>> tag2.attrs
{'class': ['title']}

NavigableString 实例

>>> tag1.string
'Elsie'
>>> type(tag1.string)
<class 'bs4.element.NavigableString'>
>>> tag2.string
"The Dormouse's story"
>>> type(tag2.string)
<class 'bs4.element.NavigableString'>

Comment 实例（与 NavigableString 的区别）
Comment 对象是一个特殊类型的 NavigableString 对象

>>> new_soup = BeautifulSoup('<b><!--This is a comment--></b><p>This is a comment</p>')
>>> new_soup.b.string
'This is a comment'
>>> type(new_soup.b.string)
<class 'bs4.element.Comment'>
>>> new_soup.p.string
'This is a comment'
>>> type(new_soup.p.string)
<class 'bs4.element.NavigableString'>

如果 Comment 和 NavigableString 都有

>>> new_soup = BeautifulSoup('<b><!--This is a comment-->This is a NagableString</b><p>This is a comment</p>')
>>> tagb = new_soup.b
>>> tagb.name
'b'
>>> tagb.string
>>> print(tagb.string)
None
>>> tagb.string
>>> type(tagb.string)
<class 'NoneType'>

五、遍历文档树

从下例中知 BeautifulSoup.<Tag> 只能获得当前名字的第一个tag

>>> soup.html
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>>> soup.body
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想要通过名字得到比一个 tag 更多的内容的时候就需要用 Searching the tree 中描述的方法, 比如 : find_all()

>>> soup.find_all('p')
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

Tag 的属性

.contents 属性可以将 tag 的子节点以列表的方式输出

.children 生成器,可以对 tag 的子节点进行循环

.descendants 属性可以对所有 tag 的子孙节点进行递归循环

.string tag 只有一个 NavigableString 类型子节点, 那么这个tag 可以使用 .string 得到子节点, 如果 tag 包含了多个子节点, tag 就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是None, 上面举出了例子

 .strings 如果 tag 中包含多个字符串可以使用 .strings 来循环获取

.contents 属性实例

>>> html = soup.html
>>> html.contents
[<head><title>The Dormouse's story</title></head>, '\n', <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>]

.children 属性实例

>>> for child in soup.children:
        print(child)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

>>> body = soup.body
>>> for child in body.children:
    print(child)

<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

.descendants 属性实例

>>> for child in html.descendants:
    print(child)


<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...


>>> 
>>> for child in body.descendants:
    print(child)




<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

.strings 属性实例

输出的字符串中可能包含了很多空格或空行, 使用 .stripped_strings 可以去除多余空白内容

>>> for string in soup.strings:
    print(string)


The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...


>>> for string in soup.stripped_strings:
    print(string)


The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
>>>

tag或字符串的属性

    .parent 获取某个元素的父节点

    .next_sibling 查询下一个兄弟节点

    .previous_sibling 查询前一个兄弟节点

    .next_element 属性指向解析过程中下一个被解析的对象(字符串或tag)
    .previous_element 属性刚好与 .next_element 相反,它指向当前被解析的对象的前一个解析对象

.parent 属性实例

>>> soup.parent
>>> print(soup.parent)
None
>>> html.parent
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>>> body.parent
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
>>> b.parent
<p class="title"><b>The Dormouse's story</b></p>
>>> p.parent
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
>>>

.next_siblings 和 .previous_siblings 属性实例

>>> for sibling in soup.a.next_siblings:
    print(repr(sibling))


',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
>>> 
>>> 
>>> for sibling in soup.find(id = 'link3').previous_siblings:
    print(repr(sibling))


' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

.next_elements 和 .previous_elements属性实例

>>> for element in body.next_element:
    print(repr(element))


'\n'
>>> 
>>> 
>>> for element in html.next_element:
    print(repr(element))


<title>The Dormouse's story</title>
>>> 
>>> for element in html.next_elements:
    print(repr(element))


<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
"The Dormouse's story"
'\n'
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
'\n'
<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
"The Dormouse's story"
'\n'
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
'Once upon a time there were three little sisters; and their names were\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Elsie'
',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
'Lacie'
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

六、搜索文档树

find_all() 方法搜索当前 tag 的所有 tag 子节点, 并判断是否符合过滤器的条件

*find_all( name , attrs , recursive , text , **kwargs )*

name 可以查找所有名字为 name 的 tag, 字符串对象会被自动忽略掉

keyword 如果一个指定名字的参数不是搜索内置的参数名, 搜索时会把该参数当作指定名字 tag 的属性来搜索, 按照 CSS 类名搜索 tag 的功能非常实用,但标识 CSS 类名的关键字 class 在 Python 中是保留字, 使用 class 做参数会导致语法错误。 从 Beautiful Soup 的4.1.1版本开始, 可以通过class_ 参数搜索有指定 CSS 类名的 tag


limit 限制返回结果的数量

recursive 调用 tag 的 find_all() 方法时, Beautiful Soup 会检索当前 tag 的所有子孙节点, 如果只想搜索 tag 的直接子节点, 可以使用参数 recursive=False

name 参数实例

>>> soup.find_all('p')
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

keyword 参数实例

>>> soup.find_all(id = 'link2')
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

>>> soup.find_all('a', class_ = 'sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

text 参数实例

>>> soup.find_all(text = 'Elsie')
['Elsie']

limit 参数实例

>>> soup.find_all('a', limit = 2)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
>>> soup.find_all('a', limit = 1)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

dsjdjsa

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
打开 bs4 之门

一、BeautifulSoup4 库测试文档地址：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html https://www.crummy.com/software/BeautifulSoup/bs4/doc/安装方法：pip install beautifulsoup4用BeautifulSoup
复制链接

扫一扫