文章目录
一:基本概念
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库
二:基础
1、bs4的对象种类
通过下面示例来解释:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<h><!-- 在此处写注释 --></h>
"""
-
tag : 标签
soup = BeautifulSoup(html_doc, 'lxml') # tag print(type(soup.html)) # <class 'bs4.element.Tag'>
-
NavigableString : 可导航的字符串
soup = BeautifulSoup(html_doc, 'lxml') # NavigableString print(type(soup.p.string)) # <class 'bs4.element.NavigableString'>
-
BeautifulSoup : bs对象
soup = BeautifulSoup(html_doc, 'lxml') # BeautifulSoup : bs对象 print(type(soup)) # <class 'bs4.BeautifulSoup'>
-
Comment : 注释
soup = BeautifulSoup(html_doc, 'lxml') # Comment : 注释 print(type(soup.h.string)) # <class 'bs4.element.Comment'> print(soup.h.string) # 在此处写注释
2、遍历文档树
通过下面示例来解释:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">hhhhhhhhhhh</p>
<h><!-- 在此处写注释 --></h>
"""
soup = BeautifulSoup(html_doc, 'lxml')
(1).遍历子节点
-
contents: 返回的是一个所有子节点的列表
a = soup.head.contents print(a) # [<title>The Dormouse's story</title>]
-
children: 返回的是一个子节点的迭代器
a = soup.head.children print(a) # <list_iterator object at 0x000002ADC4A93710> for i in a: print(i) # <title>The Dormouse's story</title>
-
descendants: 返回的是一个生成器遍历子子孙孙
a = soup.head