信息标记与提取方法

最新推荐文章于 2021-12-21 21:57:06 发布

不多余的星星

最新推荐文章于 2021-12-21 21:57:06 发布

阅读量625

点赞数 1

分类专栏： Python Learning 爬虫

本文链接：https://blog.csdn.net/CJX_up/article/details/77622791

版权

Python Learning 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

爬虫

7 篇文章 0 订阅

订阅专栏

一、信息标记

这里写图片描述

二、信息标记的三种形式

信息标记的形式有：XML, JSON, YAML

1) XML：

这里写图片描述

2) JSON：

这里写图片描述

3) YAML：

这里写图片描述

三种信息标记形式的比较

这里写图片描述

四、信息提取的一般方法

这里写图片描述

实例：提取 HTML 中所有 URL 链接。(HTML 文本见代码)
思路：1) 搜索到所有 a 标签
2) 解析 a 标签格式，提取属性 href 后的链接内容

代码：

from bs4 import BeautifulSoup

demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''
soup = BeautifulSoup(demo, "html.parser")
for link in soup.find_all('a'):
    print(link.get('href'))

"""
结果如下：
http://www.icourse163.org/course/BIT‐268001
http://www.icourse163.org/course/BIT‐1001870001
"""

五、基于 bs4 库的 HTML 内容查找方法

使用 find_all() 函数：

这里写图片描述

代码：

from bs4 import BeautifulSoup
import re

demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''
soup = BeautifulSoup(demo, "html.parser")
# 搜索所有 a 标签并返回其内容
print('case 1:\n', soup.find_all('a'), '\n')     
# 搜索所有 a,b 标签并返回其内容
print('case 2:\n', soup.find_all(['a', 'b']), '\n')    
# 打印所有标签的名称
print('case 3:')
for tag in soup.find_all(True):
    print(tag.name)
print('\ncase 4:')
# 打印所有以字母 b 开头的标签名称 (使用了正则表达式)
for tag in soup.find_all(re.compile(r'b')):
    print(tag.name)
print('\ncase 5:')
print(soup.find_all('b', 'py1'), '\n')
# 匹配 id 属性，是精确匹配，匹配的字符串需要完全相同，否则要用正则表达式
print('case 6:')
print(soup.find_all(id = 'link1'), '\n')
print(soup.find_all(id = 'link'), '\n')     # 返回空列表
print(soup.find_all(id = re.compile(r'link')))

"""
结果如下：
case 1:
 [<a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">Basic
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT‐1001870001" id="link2">Advanced Python</a>] 

case 2:
 [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">Basic
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT‐1001870001" id="link2">Advanced Python</a>] 

case 3:
html
head
title
body
p
b
p
a
a

case 4:
body
b

case 5:
[] 

case 6:
[<a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">Basic
Python</a>] 

[] 

[<a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">Basic
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT‐1001870001" id="link2">Advanced Python</a>]
"""

这里写图片描述

【注】本文图片均来自北京理工大学网络公开课《Python网络爬虫与信息提取》课件

不多余的星星

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
信息标记与提取方法

一、信息标记二、信息标记的三种形式信息标记的形式有：XML, JSON, YAML1) XML：2) JSON：3) YAML：三种信息标记形式的比较四、信息提取的一般方法实例：提取 HTML 中所有 URL 链接。(HTML 文本见代码) 思路：1) 搜索到所有 a 标签 2) 解析 a 标签格式，提取属性 href 后的链接内容代码：from bs4 import Bea
复制链接

扫一扫

专栏目录