一、Beautiful Soup入门
1、对Beautiful Soup的理解
1)Beautiful Soup库是解析、遍历、维护‘标签树’的功能库
2)BeautifulSoup对应一个HTML/XML文档的全部内容
3)代码示例(功能库的导入、解析和获取标签)
from bs4 import BeautifulSoup #注意这里BeautifulSoup连在一起的,表示导入一个类
soup = BeautifulSoup(demo, 'html.parser') #创建一个实例
soup.a #soup.<tag> 返回第一个标签
2、BeautifulSoup类的基本元素
1)tag
:标签,最基本的信息组成单元,分别用<>
和</>
表明开头和结尾
2)name
:标签的名字,<p>...</p>
的名字是'p'
,格式:<tag>.name
3)Attributes
:标签的属性,字典形式组织,格式:<tag>.attrs
4)NavigableString
:标签内非属性字符串,<>...</>
的字符串,格式:<tag>.string
5)Comment
:标签内字符串的注释部分,一种特殊的comment
类型。
【注,4与5都可由tring属性导出,但注意两者类型不一样】
3、HTML基本格式与标签遍历
<>...</>
构成所属关系,形成了标签的树形结构
import requests
from bs4 import BeautifulSoup
res = requests.get('https://python123.io/ws/demo.html')
demo = res.text #这里不能缺少text
soup = BeautifulSoup(demo,'html.parser')
print(soup.prettify())
#以下为输出结果
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
1)下行遍历:
①.contents
:子节点的列表,将<tag>
所有儿子节点存入列表
②.children
:子节点的迭代类型,与.contents
类似,用于循环遍历儿子节点
(3).descendants
:子孙节点的迭代类型,包含所有子孙节点,用于循环遍历
soup.body.contents
#以下为返回结果,注意返回的包括换行符/n,
"""
['\n',
<p class="title"><b>The demo python introduces several python courses.</b></p>,
'\n',
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>,
'\n']
"""
len(soup.body.contents) #返回结果为:5
#标签数并非只有标签组成,包括字符串
#遍历儿子节点
for child in soup.body.children:
print(child)
#以下为返回结果(注意换行符,一个为标签中自带,另一个为print()函数产生)
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
#遍历子孙节点
for child in soup.body.descendants:
print(child)
#以下为返回内容
<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.
2)上行遍历
①.parents
:节点的父亲标签,将<tag>
所有儿子节点存入列表
②.parents
:节点先辈标签的迭代类型,用于循环遍历先辈节点
soup.html.parent #返回的仍为html
soup.parent #返回的为空
3)平行遍历
①.next_sibling
:返回按照HTML文本顺序的下一个平行节点标签
②..previous_sibling
:返回按照HTML文本顺序的上一个平行节点标签
(3).next_siblings
:迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
(4).previous_siblings
:迭代类型,返回按照HTML文本顺序的前续所有平行节点标签
【注1】:平行遍历发生在同一个父亲节点下的各节点间
【注2】:父节点的文本(字符串)与子节点构成平行关系
soup.a.next_sibling #返回 and
soup.a.previous_sibling
#以下为返回内容
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
4、prettify()
方法
1).prettify()
可以为HTML文本<>
极其内容增加'\n'
2).prettify()
可用于标签,方法:<tag>.prettify()
二、信息标记与提取
1、信息标记的三种形式及比较
1)XML:最早的通用信息标记语言,可拓展性好,但繁琐,(Internet上的信息交互与传递)
2)Json:信息有类型,适合程序处理(js),较XML简洁,(移动应用云端和节点的信息通信,无注释
3)YAML:信息无类型,文本信息比例最高,可读性好(各类系统的配置文件,有注释易读)
3、信息提取的一般方法
1)方法一:完整解析信息的标记形式,在提取关键信息。XML,JSON,YAML,需要标记解析器,例如:bs4库的标签树遍历
优点:信息解析准确;缺点:提取过程繁琐,速度慢
2)方法二:无视标记信息,直接搜索关键信息。搜索,对信息的文本查找函数即可。
优点:提取过程简洁,速度较快;缺点:提取结果准确性与信息内容相关。
3)融合方法:结合形式解析与搜索方法,提取关键信息。XML,JSON,YAML,搜索
需要标记解析器及文本查找函数。
4、涉及的函数find_all(name,attrs,recursive,string,**kwargs)
<>.find_all(name,attrs,recursive,string,**kwargs)
:返回一个列表,存储符合参数的标签。name
后面是参数传入尽量通过关键字传参
name
:对标签名称的检索字符串
attrs
:对标签的属性值的检索字符串,可标注属性检索
recursive
:是否对子孙全部索引,默认True
string
:<>...</>
中字符串区域的检索字符串
用法:
<tag>(...)
等价于<tag>.find_all(...)
,soup(...)
等价于soup.find_all(...)
5、实例
#单元五-信息组织和提取
import requests
from bs4 import BeautifulSoup
res = requests.get('https://python123.io/ws/demo.html')
demo = res.text #这里不能缺少text
soup = BeautifulSoup(demo,'html.parser')
soup.find_all('a') #按照标签名搜索
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
(输出)
soup.find_all(['a','b']) #同时搜索多个标签名
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
(输出)
for tag in soup.find_all(True): #没有关键字,一般按照标签名处理
print(tag.name)
html head title body p b p a a
(输出,应该是竖着的)
for tag in soup.find_all(re.compile('b')):
print(tag.name)
body b
(输出)
soup.find_all('p','course') #多参数传入此处应该是按照位置传参
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
(输出)
#单元五-信息组织和提取
import requests
from bs4 import BeautifulSoup
res = requests.get('https://python123.io/ws/demo.html')
demo = res.text #这里不能缺少text
soup = BeautifulSoup(demo,'html.parser')
soup.find_all(id='link1') #返回 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
soup.find_all(id='link') #返回 []
soup.find_all('a',recursive=False) #返回[]
soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
soup.find_all(string='python' ) #返回 []
soup.find_all(string=re.compile('python') ) #返回
['This is a python demo page', 'The demo python introduces several python courses.']
(输出)
6、拓展方法