二、网络爬虫之提取（2）

最新推荐文章于 2024-09-11 18:01:31 发布

HolllllldOn

最新推荐文章于 2024-09-11 18:01:31 发布

阅读量148

点赞数

分类专栏：爬虫笔记（MOOC Python网络爬虫与信息提取）文章标签： python

本文链接：https://blog.csdn.net/HolllllldOn/article/details/107365756

版权

爬虫笔记（MOOC Python网络爬虫与信息提取）专栏收录该内容

11 篇文章 0 订阅

订阅专栏

信息组织与提取方法

1.信息标记的三种形式

信息的标记

标记后的信息可形成信息组织结构，增加了信息维度
标记的结构与信息一样具有重要价值
标记后的信息可用于通信、存储或展示
标记后的信息更利于程序理解和运用

HTML的信息标记

在这里插入图片描述
HTML通过预定义的<>…</>标签形式组织不同类型的信息


<html>
	<head>
		<title>This is a python demo page</title>
	</head>
<body>
	<p class="title"><b>The demo python introduces several python courses.		</b></p>
	<p class="course">Python is a wonderful general-purpose programming   language. You can learn Python from novice to professional by tracking the following courses:
	<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>
	.
	</p>
	</body>
</html>

信息标记的三种形式

XML

在这里插入图片描述

JSON

在这里插入图片描述

YAML

在这里插入图片描述

2.三种信息标记形式的比较

XML

在这里插入图片描述

JSON

在这里插入图片描述

YAML

在这里插入图片描述
XML实例

<person>
	<firstName>Tian</firstName>
	<lastName>Song</lastName>
	<address>
		<streetAddr>中关村南大街5号</streetAddr>
		<city>北京市</city>
		<zipcode>100081</zipcode>
	</address>
	<prof>Computer System</prof><prof>Security</prof>
</person>

JSON实例

{
	“firstName” : “Tian” ,
	“lastName” : “Song” ,
	“address” : {
						“streetAddr” : “中关村南大街5号” ,
						“city” : “北京市” ,
						“zipcode” : “100081”
} ,
	“prof” : [ “Computer System” , “Security” ]
}

YAML实例

firstName : Tian
lastName : Song
address :
	streetAddr : 中关村南大街5号
	city : 北京市
	zipcode : 100081
prof :
‐Computer System
‐Security

比较

XML：最早的通用信息标记语言，可扩展性好，但繁琐
JSON ：信息有类型，适合程序处理(js)，较XML简洁
YAML：信息无类型，文本信息比例最高，可读性好

XML：Internet上的信息交互与传递
JSON：移动应用云端和节点的信息通信，无注释
YAML：各类系统的配置文件，有注释易读

3.信息提取的一般方法

从标记后的信息中提取所关注的内容

方法一：完整解析信息的标记形式，再提取关键信息
XML JSON YAML
需要标记解析器，例如：bs4库的标签树遍历
优点：信息解析准确
缺点：提取过程繁琐，速度慢
方法二：无视标记形式，直接搜索关键信息搜索
对信息的文本查找函数即可
优点：提取过程简洁，速度较快
缺点：提取结果准确性与信息内容相关
融合方法：结合形式解析与搜索方法，提取关键信息
XML JSON YAML 搜索
需要标记解析器及文本查找函数

实例

提取HTML中所有URL链接

思路：
1）搜索到所有<a>标签
2）解析<a>标签格式，提取href后的链接内容

>>> from bs4 import BeautifulSoup
>>> import requests
>>> r = requests.get('https://python123.io/ws/demo.html')
>>> demo = r.text
>>> soup = BeautifulSoup(demo,'html.parser')
>>> for link in soup.find_all('a'):
	print(link.get('href'))

http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
>>>

4.基于bs4库的HTML内容查找方法

<>.find_all(name, attrs, recursive, string, **kwargs)

返回一个列表类型，存储查找的结果

name : 对标签名称的检索字符串
attrs: 对标签属性值的检索字符串，可标注属性检索
recursive: 是否对子孙全部检索，默认True
string: <>…</>中字符串区域的检索字符串

查找所有<a>标签：

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>

查找所有标签的name

>>> for tag in soup.find_all(True):
	print(tag.name)

html
head
title
body
p
b
p
a
a
>>>

查找以a开头的标签

>>> import re
>>> for tag in soup.find_all(re.compile('b')):
	print(tag.name)

body
b
>>>

查找p标签中带有course属性值的标签

>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>>

查找id域=link1的标签

>>> soup.find_all(id = 'link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>>

查找id域=link的标签

>>> soup.find_all(id = 'link')
[]
>>>

查找以link开头的标签

>>> import re
>>> soup.find_all(id = re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>

搜索当前节点儿子节点的信息

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive= False)
[]
>>>

检索“Basic Python”字符串

>>> soup.find_all(string = 'Basic Python')
['Basic Python']
>>>

检索含有“Python”的标签

>>> import re
>>> soup.find_all(string = re.compile('Python'))
['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n', 'Basic Python', 'Advanced Python']
>>>