beautifulsoup
以下内容为转载
beautiful soup选择器之CSS选择器
BeautifulSoup支持大部分的CSS选择器,其语法为:向tag或soup对象的.select()方法中传入字符串参数,选择的结果以列表形式返回。
tag.select("string")
BeautifulSoup.select("string")
源代码示例:
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title" name="dromouse"> <b>The Dormouse's story</b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="mysis" href="http://example.com/elsie" id="link1"> <b>the first b tag<b> Elsie </a>, <a class="mysis" href="http://example.com/lacie" id="link2" myname="kong"> Lacie </a>and <a class="mysis" href="http://example.com/tillie" id="link3"> Tillie </a>;and they lived at the bottom of a well. </p> <p class="story"> myStory <a>the end a tag</a> </p> <a>the p tag sibling</a> </body> </html> """
soup = BeautifulSoup(html,'lxml')
1、通过标签选择
1
2
3
4
5
6
7
8
9
10
11
12
|
# 选择所有title标签
soup.
select
(
"title"
)
# 选择所有p标签中的第三个标签
soup.
select
(
"p:nth-of-type(3)"
)
# 选择body标签下的所有a标签
soup.
select
(
"body a"
)
# 选择body标签下的直接a子标签
soup.
select
(
"body > a"
)
# 选择id=link1后的所有兄弟节点标签
soup.
select
(
"#link1 ~ .mysis"
)
# 选择id=link1后的下一个兄弟节点标签
soup.
select
(
"#link1 + .mysis"
)
|
2、通过类名查找
1
2
|
# 选择a标签,其类属性为mysis的标签
soup.
select
(
"a.mysis"
)
|
3、通过id查找
1
2
|
# 选择a标签,其id属性为link1的标签
soup.
select
(
"a#link1"
)
|
4、通过【属性】查找,当然也适用于class
1
2
3
4
5
6
7
8
9
10
|
# 选择a标签,其属性中存在myname的所有标签
soup.
select
(
"a[myname]"
)
# 选择a标签,其属性href=http://example.com/lacie的所有标签
soup.
select
(
"a[href='http://example.com/lacie']"
)
# 选择a标签,其href属性以http开头
soup.
select
(
'a[href^="http"]'
)
# 选择a标签,其href属性以lacie结尾
soup.
select
(
'a[href$="lacie"]'
)
# 选择a标签,其href属性包含.com
soup.
select
(
'a[href*=".com"]'
)
|
5、tag.select
1
2
3
4
|
# 选择第一个a标签中的b标签的文本内容
atags = soup.
select
(
'a'
)[0]
atags = atags.
select
(
'b'
)[0].get_text()
print atags
|