如何使用CSS选择器:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
soup.select()
BeautifulSoup对象的.select()方法中传入字符串参数,选择的结果以列表形式返回.
css基本语法
元素选择器:
直接选择文档元素
比如head,p
类选择器:
元素的class属性,比如<h1 class="important">
类名就是important
.important选择所有有这个类属性的元素
可以结合元素选择器,比如p.important
ID选择器:
元素的id属性,比如<h1 id="intro">
id就是intro
#intro用于选择id=intro的元素
可以结合元素选择器,比如p#intro
属性选择器:
选择有某个属性的元素,而不论值是什么。
*[title]选择所有包含title属性的元素
a[href]选择所有带有href属性的锚元素
还可以选择多个属性,比如:a[href][title],注意这里是要同时满足。
限定值:a[href="www.so.com"]
后代(包含)选择器:
选择某元素后代的元素(层级不受限制)
选择h1元素的em元素:h1 em
子元素选择器:
范围限制在子元素
选择h1元素的子元素strong:h1 > strong
具体参考如下表:
例子
test.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>hjk</title>
</head><body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" title="12" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
解析网页
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'), 'html.parser')
1.通过元素标签查找
1.
print(soup.select('title')) # 选择所有的titel标签
print(soup.select('p')) # 选择所有的p标签
print(soup.select('p')[0]) # 选择第一个p标签#输出:
[<title>hjk</title>]
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
2.
print(soup.select('p a')) # 寻找p标签的a标签
print(soup.select('body a')) # 寻找body标签下的a标签
#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
3.
print(soup.select('body > a')) # 寻找body标签下子节点a标签
print(soup.select('p > #link1')) # 寻找p标签子节点中id='link1'的标签#输出
[]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>]
body > a 找的是子标签
4.
print(soup.select('#link1 ~ .sister')) # 寻找id='link1',class='sister'标签的兄弟标签
print(soup.select('#link1 + .sister')) # 寻找id='link1',class='sister'标签的下一个兄弟标签#输出
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
2.通过CSS类名查找
print(soup.select('.sister')) # 获得所有class为sister的标签
print(soup.select('p.title')) # 获得P标签下class类名为title的标签。#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
3.通过标签的id属性查找
print(soup.select('#link1')) # 寻找所有id='link1'的标签
print(soup.select('#link1,#link2')) # 寻找所有id为link1或link2的标签#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
4.通过是否存在某个属性来查找
print(soup.select('a[href]')) # 查找a标签下存在herf属性的标签#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
5.通过属性的值来查找
print(soup.select('a[href="http://example.com/elsie"]')) # 寻找a标签中href="http://example.com/elsie"的标签
print(soup.select('a[href^="http://example.com/"]')) # 寻找href属性值是以"http://example.com/"开头的a标签
print(soup.select('a[href$="tillie"]'))#寻找href属性值是以tillie为结尾的a标签
print(soup.select('a[href*=".com/el"]'))#寻找href属性值中存在字符串”.com/el”的标签a#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>]
6.通过标签逐层查找
Atag = soup.select('p')[1]
Btag = Atag.select('[title="12"]')
print(Btag)#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>]
7.获取属性
a=soup.select('p #link2')
print(a[0].attrs['href'])#输出
<a class="sister" href="http://example.com/elsie" id="link1" title="12"><!-- Elsie --></a>
8.获取文本
print(a[0].string)#输出
http://example.com/lacie