MOOC_北理_Python爬虫学习_4 （信息标记与提取方法）

最新推荐文章于 2021-01-30 01:37:45 发布

ExcitingYi

最新推荐文章于 2021-01-30 01:37:45 发布

阅读量252

点赞数

文章标签： python

本文链接：https://blog.csdn.net/ExcitingYi/article/details/106106855

版权

信息标记的三种形式

HTML（超文本标记语言）的信息标记：将声音视频图像等超文本信息嵌入到文本中。
HTML通过预定义的<>…</>标签形式组织不同类型的信息。

信息标记的三种形式：XML、JSON、YAML

1. XML：扩展标记语言 eXtensible Markup Language 以标签为主表达信息。与html的格式非常接近。<img src="china.jpg" size="10">...</img> 注释也是 <!....>形式。xml通过标签形式，构建信息。

2. JSON：JavsScript Object Notation 面向对象。有类型的键值对 “key” : “value” 构成的信息表达形式。键值对有数据类型。一键多值用中括号体现 “name":["XXX","YYY"] 。键值对可以嵌套使用，嵌套用大括号体现 "name":{"newName":"XX","oldname":"YY"}

3. YAML: YAML Ain’t Markup Language 无类型键值对表达信息。name : XX。可以用缩进的形式表达所属关系如

name: 
	newName : XXX
	oldName : YYY

用 - 表达并列关系，如：

name :
-XXX
-YYY

用 | 表示整块数据，如：

text: |		#注释
XXXXXXXXXXXX

YAML没有像JSON一样那么多的方括号大括号啥的乱七八糟的东西。

三种信息标记形式的比较：
XML实例：

<person>
	<firstName>Tian</firstName>
	<lastName>Song</lastName>
	<address>
		<streetAddr>中关村南大街5号</streetAddr>
		<city>北京市</city>
		<zipcode>100081</zipcode>
	</address>
	<prof>Computer System</prof><prof>Security</prof>
</person>
'''有效信息所占比例并不高，大多信息被标签占用'''

JSON实例：

{
	"firstName" : "Tian",
	"lastName" : "Song",
	"address" : {
		"streetAddr" : "中关村南大街5号",
		"city" : "北京市",
		"zipcode" : "100081"
		}
	"prof" : ["Computer System" , "Security"]
}

YAML实例：

firstName : Tian
lastName : Song
address : 
	streetAddr : 中关村南大街5号
	city : 北京市
	zipcode : 100081
prof :
-Computer System
-Security

XML：是最早的通用信息标记语言。扩展性号。比较繁琐。主要用于Internet上的信息交互与传递。
JSON：适合程序处理，本身就可以是程序的一部分。移动应用云端和节点通讯中。（程序对接口处理的地方）但无注释。
YAML：信息误类型，文本信息比例最高，可读性好。应用较广，各类系统的配置文件，有注释易读。

信息提取的一般方法：
方法一：完整解析信息标记形式，再提取关键信息。需要标记解析器。如bs4库的标签树遍历。优点：信息解析准确。缺点：提取过程繁琐，速度慢。
方法二：无视标记形式，直接搜索关键信息。利用特定函数查找特定信息。优点：提取过程简洁，速度较快。缺点：提取结果准确性与信息内容直接相关。
方法三：融合方法。

实例：提取HTML中所有URL链接
思路：1.搜索到所有含有url信息的标签。2.提取标签中href后的链接内容。

from bs4 import BeautifulSoup
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
for link in soup.find_all('a'):	
	print(link.get("href"))

http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

基于bs4库的HTML内容查找方法：
方法 <>.find_all(name,attrs,recursive,string,**kwargs)在soup的变量中查找信息，并存储查找的结果。

'''name：对标签名称的检索字符串。'''
>>> soup.find_all("a")		#查找出所有a标签。

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>> soup.find_all(['a','b'])	#找出所有a和b标签，并且以列表形式存储。

[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>> soup.find_all(True)		#找出所有标签（结果过长不予显示）



'''attrs：对标签属性值的检索字符串，可标注属性检索。就是检索某个标签中，包含某个属性的字符信息。（有点绕。。。。但简单来说就是下面几种）'''

#1. 检索带有course属性的p标签信息。返回的是列表类型。
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses：
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]

#2. 对属性做相关约定，查找id属性等于link1的值作为查找元素：
>>> soup.find_all(id = 'link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

>>> soup.find_all(id = 'link')
[]		#不存在id = link

>>> import re 	#利用re库。查找所有包含link的所有内容。
>>> soup.find_all(id = re.compile("link"))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]


'''recursive:是否对子孙全部节点进行检索，默认为True。'''
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive = False)
[]	#说明在soup的子节点中没有a标签，只存在于孙节点及以下。


'''string: <>...</>中字符串区域的检索字符串。'''
>>> soup.find_all(string = "Basic Python")
['Basic Python']
>>> soup.find_all(string = re.compile("Python"))
['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n', 'Basic Python', 'Advanced Python']

.find_all方法可以简写: <tag>(..) 等价于 <tag>.find_all(..)

扩展方法（参数于find_all一样。）：

在这里插入图片描述

问题：attrs是啥啊有点晕了。
。。。

ExcitingYi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MOOC_北理_Python爬虫学习_4 （信息标记与提取方法）

信息标记的三种形式HTML（超文本标记语言）的信息标记：将声音视频图像等超文本信息嵌入到文本中。HTML通过预定义的<>…</>标签形式组织不同类型的信息。信息标记的三种形式：XML、JSON、YAML1. XML：扩展标记语言 eXtensible Markup Language 以标签为主表达信息。与html的格式非常接近。<img src="china.jpg" size="10">...</img> 注释也是 <!....>形式。
复制链接

扫一扫