学习笔记_02(单元四：Beautiful Soup入门＆单元五：信息标记与提取方法)

最新推荐文章于 2023-02-09 10:18:24 发布

穆藩6211

最新推荐文章于 2023-02-09 10:18:24 发布

阅读量285

点赞数 1

分类专栏：课程笔记《Python网络爬虫与信息提取(嵩天老师)》文章标签： python html

本文链接：https://blog.csdn.net/weixin_45033674/article/details/105750547

版权

课程笔记《Python网络爬虫与信息提取(嵩天老师)》专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、Beautiful Soup入门
1、对Beautiful Soup的理解
1）Beautiful Soup库是解析、遍历、维护‘标签树’的功能库
2）BeautifulSoup对应一个HTML/XML文档的全部内容
3）代码示例（功能库的导入、解析和获取标签）

from bs4 import BeautifulSoup	#注意这里BeautifulSoup连在一起的，表示导入一个类
soup = BeautifulSoup(demo, 'html.parser')	#创建一个实例
soup.a										#soup.<tag> 返回第一个标签

2、BeautifulSoup类的基本元素
在这里插入图片描述

1）tag：标签，最基本的信息组成单元，分别用<>和</>表明开头和结尾
2）name：标签的名字，...的名字是'p',格式：<tag>.name
3）Attributes：标签的属性，字典形式组织，格式：<tag>.attrs
4）NavigableString：标签内非属性字符串，<>...</>的字符串，格式：<tag>.string
5）Comment：标签内字符串的注释部分，一种特殊的comment类型。
【注，4与5都可由tring属性导出，但注意两者类型不一样】
3、HTML基本格式与标签遍历
<>...</>构成所属关系，形成了标签的树形结构

import requests
from bs4 import BeautifulSoup
res = requests.get('https://python123.io/ws/demo.html')
demo = res.text     #这里不能缺少text
soup = BeautifulSoup(demo,'html.parser')
print(soup.prettify())

#以下为输出结果
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

在这里插入图片描述
1）下行遍历：
①.contents：子节点的列表，将<tag>所有儿子节点存入列表
②.children：子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
(3).descendants：子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

soup.body.contents
#以下为返回结果，注意返回的包括换行符/n,
"""
['\n',
 <p class="title"><b>The demo python introduces several python courses.</b></p>,
 '\n',
 <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>,
 '\n']
 """
 len(soup.body.contents) 	#返回结果为：5
 #标签数并非只有标签组成，包括字符串

#遍历儿子节点
for child in soup.body.children:
    print(child)
#以下为返回结果（注意换行符，一个为标签中自带，另一个为print()函数产生）


<p class="title"><b>The demo python introduces several python courses.</b></p>


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

#遍历子孙节点
for child in soup.body.descendants:
    print(child)
#以下为返回内容


<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.

2）上行遍历
①.parents：节点的父亲标签，将<tag>所有儿子节点存入列表
②.parents：节点先辈标签的迭代类型，用于循环遍历先辈节点

soup.html.parent #返回的仍为html
soup.parent		#返回的为空

3）平行遍历
①.next_sibling：返回按照HTML文本顺序的下一个平行节点标签
②..previous_sibling：返回按照HTML文本顺序的上一个平行节点标签
(3).next_siblings：迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
(4).previous_siblings：迭代类型，返回按照HTML文本顺序的前续所有平行节点标签
【注1】：平行遍历发生在同一个父亲节点下的各节点间
【注2】：父节点的文本（字符串）与子节点构成平行关系

soup.a.next_sibling #返回 and
soup.a.previous_sibling
#以下为返回内容
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'

4、prettify()方法
1）.prettify()可以为HTML文本<>极其内容增加'\n'
2).prettify()可用于标签，方法：<tag>.prettify()
二、信息标记与提取
1、信息标记的三种形式及比较
1)XML：最早的通用信息标记语言，可拓展性好,但繁琐,（Internet上的信息交互与传递）
2)Json:信息有类型,适合程序处理(js),较XML简洁,(移动应用云端和节点的信息通信,无注释
3)YAML：信息无类型，文本信息比例最高，可读性好(各类系统的配置文件，有注释易读)
3、信息提取的一般方法
1）方法一：完整解析信息的标记形式，在提取关键信息。XML，JSON，YAML，需要标记解析器，例如：bs4库的标签树遍历
优点：信息解析准确；缺点：提取过程繁琐，速度慢
2）方法二：无视标记信息，直接搜索关键信息。搜索，对信息的文本查找函数即可。
优点：提取过程简洁，速度较快；缺点：提取结果准确性与信息内容相关。
3）融合方法：结合形式解析与搜索方法，提取关键信息。XML，JSON，YAML，搜索
需要标记解析器及文本查找函数。
4、涉及的函数find_all(name,attrs,recursive,string,**kwargs)
<>.find_all(name,attrs,recursive,string,**kwargs):返回一个列表,存储符合参数的标签。name后面是参数传入尽量通过关键字传参
name：对标签名称的检索字符串
attrs：对标签的属性值的检索字符串，可标注属性检索
recursive：是否对子孙全部索引，默认True
string：<>...</>中字符串区域的检索字符串
用法：
<tag>(...)等价于<tag>.find_all(...)，soup(...)等价于soup.find_all(...)
5、实例

#单元五-信息组织和提取
import requests
from bs4 import BeautifulSoup
res = requests.get('https://python123.io/ws/demo.html')
demo = res.text     #这里不能缺少text
soup = BeautifulSoup(demo,'html.parser')
soup.find_all('a') #按照标签名搜索

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]（输出）

soup.find_all(['a','b']) 	#同时搜索多个标签名

[The demo python introduces several python courses., <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]（输出）

for tag in soup.find_all(True): #没有关键字，一般按照标签名处理
    print(tag.name)

html head title body p b p a a(输出，应该是竖着的)

for tag in soup.find_all(re.compile('b')):
    print(tag.name)

body b（输出）

soup.find_all('p','course') #多参数传入此处应该是按照位置传参

[Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.]（输出）

#单元五-信息组织和提取
import requests
from bs4 import BeautifulSoup
res = requests.get('https://python123.io/ws/demo.html')
demo = res.text     #这里不能缺少text
soup = BeautifulSoup(demo,'html.parser')
soup.find_all(id='link1') #返回 [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
soup.find_all(id='link')  #返回 []
soup.find_all('a',recursive=False) #返回[]

soup.find_all(id=re.compile('link'))

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

soup.find_all(string='python' ) #返回 []
soup.find_all(string=re.compile('python') ) #返回

['This is a python demo page', 'The demo python introduces several python courses.']（输出）
6、拓展方法
在这里插入图片描述

穆藩6211

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
学习笔记_02(单元四：Beautiful Soup入门＆单元五：信息标记与提取方法)

一、Beautiful Soup入门1、对Beautiful Soup的理解1）Beautiful Soup库是解析、遍历、维护‘标签树’的功能库2）BeautifulSoup对应一个HTML/XML文档的全部内容3）代码示例（功能库的导入、解析和获取标签）from bs4 import BeautifulSoup #注意这里BeautifulSoup连在一起的，表示导入一个类soup...
复制链接

扫一扫