爬虫CSS选择器

最新推荐文章于 2024-05-11 15:57:16 发布

CarisePem

最新推荐文章于 2024-05-11 15:57:16 发布

阅读量2.4k

点赞数 2

分类专栏：爬虫文章标签：爬虫选择器

原文链接：https://blog.csdn.net/Yk_0311/article/details/82708488

版权

爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

如何使用CSS选择器：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
soup.select()

BeautifulSoup对象的.select()方法中传入字符串参数，选择的结果以列表形式返回.

css基本语法

元素选择器：
直接选择文档元素
比如head，p
类选择器：
元素的class属性，比如<h1 class="important">
类名就是important
.important选择所有有这个类属性的元素
可以结合元素选择器，比如p.important
ID选择器：
元素的id属性，比如<h1 id="intro">
id就是intro
#intro用于选择id=intro的元素
可以结合元素选择器，比如p#intro
属性选择器：
选择有某个属性的元素，而不论值是什么。
*[title]选择所有包含title属性的元素
a[href]选择所有带有href属性的锚元素
还可以选择多个属性，比如：a[href][title]，注意这里是要同时满足。
限定值：a[href="www.so.com"]
后代（包含）选择器：
选择某元素后代的元素（层级不受限制）
选择h1元素的em元素：h1 em
子元素选择器：
范围限制在子元素
选择h1元素的子元素strong：h1 > strong

具体参考如下表：

例子

test.html

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>hjk</title>
</head>

<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" title="12" class="sister" id="link1"></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
</body>
</html>

解析网页

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('test.html'), 'html.parser')

1.通过元素标签查找

print(soup.select('title')) # 选择所有的titel标签
print(soup.select('p')) # 选择所有的p标签
print(soup.select('p')[0]) # 选择第一个p标签

#输出：
[<title>hjk</title>]
[The Dormouse's story, Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well., ...]
The Dormouse's story

print(soup.select('p a')) # 寻找p标签的a标签
print(soup.select('body a')) # 寻找body标签下的a标签

#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select('body > a')) # 寻找body标签下子节点a标签
print(soup.select('p > #link1')) # 寻找p标签子节点中id='link1'的标签

#输出
[]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>]

body > a 找的是子标签

print(soup.select('#link1 ~ .sister')) # 寻找id='link1'，class='sister'标签的兄弟标签
print(soup.select('#link1 + .sister')) # 寻找id='link1'，class='sister'标签的下一个兄弟标签

#输出
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

2.通过CSS类名查找

print(soup.select('.sister')) # 获得所有class为sister的标签
print(soup.select('p.title')) # 获得P标签下class类名为title的标签。

#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[The Dormouse's story]

3.通过标签的id属性查找

print(soup.select('#link1')) # 寻找所有id='link1'的标签
print(soup.select('#link1,#link2')) # 寻找所有id为link1或link2的标签

#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

4.通过是否存在某个属性来查找

print(soup.select('a[href]')) # 查找a标签下存在herf属性的标签

#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

5.通过属性的值来查找

print(soup.select('a[href="http://example.com/elsie"]')) # 寻找a标签中href="http://example.com/elsie"的标签
print(soup.select('a[href^="http://example.com/"]')) # 寻找href属性值是以"http://example.com/"开头的a标签
print(soup.select('a[href$="tillie"]'))#寻找href属性值是以tillie为结尾的a标签
print(soup.select('a[href*=".com/el"]'))#寻找href属性值中存在字符串”.com/el”的标签a

#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>]

6.通过标签逐层查找

Atag = soup.select('p')[1]
Btag = Atag.select('[title="12"]')
print(Btag)

#输出
[<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>]

7.获取属性

a=soup.select('p #link2')
print(a[0].attrs['href'])

#输出
<a class="sister" href="http://example.com/elsie" id="link1" title="12"></a>

8.获取文本

print(a[0].string)

#输出
http://example.com/lacie

CarisePem

关注

2
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
爬虫CSS选择器

如何使用CSS选择器：from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'html.parser')soup.select()BeautifulSoup对象的.select()方法中传入字符串参数，选择的结果以列表形式返回.css基本语法元素选择器：直接选择文档元素比如head，p类选择...
复制链接

扫一扫