一、信息标记
二、信息标记的三种形式
信息标记的形式有:XML, JSON, YAML
1) XML:
2) JSON:
3) YAML:
三种信息标记形式的比较
四、信息提取的一般方法
实例:提取 HTML 中所有 URL 链接。(HTML 文本见代码)
思路:1) 搜索到所有 a 标签
2) 解析 a 标签格式,提取属性 href 后的链接内容
代码:
from bs4 import BeautifulSoup
demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''
soup = BeautifulSoup(demo, "html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
"""
结果如下:
http://www.icourse163.org/course/BIT‐268001
http://www.icourse163.org/course/BIT‐1001870001
"""
五、基于 bs4 库的 HTML 内容查找方法
使用 find_all() 函数:
代码:
from bs4 import BeautifulSoup
import re
demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''
soup = BeautifulSoup(demo, "html.parser")
# 搜索所有 a 标签并返回其内容
print('case 1:\n', soup.find_all('a'), '\n')
# 搜索所有 a,b 标签并返回其内容
print('case 2:\n', soup.find_all(['a', 'b']), '\n')
# 打印所有标签的名称
print('case 3:')
for tag in soup.find_all(True):
print(tag.name)
print('\ncase 4:')
# 打印所有以字母 b 开头的标签名称 (使用了正则表达式)
for tag in soup.find_all(re.compile(r'b')):
print(tag.name)
print('\ncase 5:')
print(soup.find_all('b', 'py1'), '\n')
# 匹配 id 属性,是精确匹配,匹配的字符串需要完全相同,否则要用正则表达式
print('case 6:')
print(soup.find_all(id = 'link1'), '\n')
print(soup.find_all(id = 'link'), '\n') # 返回空列表
print(soup.find_all(id = re.compile(r'link')))
"""
结果如下:
case 1:
[<a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">Basic
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT‐1001870001" id="link2">Advanced Python</a>]
case 2:
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">Basic
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT‐1001870001" id="link2">Advanced Python</a>]
case 3:
html
head
title
body
p
b
p
a
a
case 4:
body
b
case 5:
[]
case 6:
[<a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">Basic
Python</a>]
[]
[<a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">Basic
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT‐1001870001" id="link2">Advanced Python</a>]
"""
【注】本文图片均来自北京理工大学网络公开课《Python网络爬虫与信息提取》课件