第九章 bs4(下) 天气信息案例
1. select()方法
我们也可以通过css选择器来提取数据,但需要我们掌握一些css语法。具体可以参考网页
css选择器参考手册。
示例代码(后面的代码都是在这个代码的基础上继续的):
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
1.1 通过标签来提取
# 提取并打印所有的标签a
print(soup.select('a'))
结果以列表返回所有的标签a,可以通过遍历列表取出所需元素。
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1.2 通过类名来查找
比如我们要查找所有类名为"sister"的a标签。
# 提取并打印所有class="sister"的标签
print(soup.select('.sister'))
结果以列表形式返回
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
1.3 通过ip来查找
# 通过id来提取 #id
print(soup.select('#link1'))
结果
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
1.4 组合提取
我们提取p标签下的所有a标签
# 组合标签提取
print(soup.select('p>a'))
结果以列表返回
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# 提取p标签下的id=link1的标签
print(soup.select('p #link1'))
结果
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
选择所有p元素和a元素
# 选取所有p元素和a元素
print(soup.select('p,a'))
结果选出了所有p标签和a标签,以列表返回
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]
选择所有p元素内的a元素
# 选取所有p元素下的a元素
print(soup.select('p a'))
结果以列表返回
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
选取所有带有href属性的元素
# 选取所有带有href属性的元素
print(soup.select('[href]'))
结果以列表返回
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# 选取所有class = "sister"的元素
print(soup.select('[class=sister]'))
结果以列表返回
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# 从属性为title的标签中取出文本
print(soup.select('.title')[0].get_text())
因为print(soup.select(’.title’))返回的是列表,所以这里需要取出列表里面的元素再.get_text()。
The Dormouse's story
1.5 案例
我们把上节课的示例用select()方法操作一下
from bs4 import BeautifulSoup
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
<tbody>
<tr class="h">
<td class="l" width="374">职位名称</td>
<td>职位类别</td>
<td>人数</td>
<td>地点</td>
<td>发布时间</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href=