所实例化的soup对象相当于一个存储了html的数据结构。下面来示例获取各种标记对象:
1. 取标签名
soup = BeautifulSoup(html_doc)
print(soup.head)
print(soup.title)
print('fisrt P:\n %s'%soup.p)
print('fisrt a:\n %s'%soup.a)
print(soup.title.name)
print(soup.title.string)
返回:
>>>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
fisrt P:
<p class="title"><b>The Dormouse's story</b></p>
fisrt a:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
title
The Dormouse's story
name和string是标签对应的两个属性。string等同于get_text()。
2. 搜索标签
soup = BeautifulSoup(html_doc)
print(soup.find('head'))
print(soup.find(id='link1'))
print(soup.find(class_='title'))
print(soup.find_all('a'))
返回:
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<p class="title"><b>The Dormouse's story</b></p>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
其中,第4行返回所有标签为a的对象,这是个列表。
3. 附加属性搜索:
print(soup.find('a',id='link3'))
print(soup.find('a','sister'))
print(soup.find('a',class_='sister'))
print(soup.find_all('a',class_='sister'))
返回:
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
4. 取标签的属性值
提取其中的链接可以这样:
alist=soup.find_all('a')
for x in alist:
print x['href']
返回:
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
x['href']等同于x.get('href')。
5. 通过内容取标签
alist=soup.find_all('a',text='Elsie')
返回:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
6. 通过类选择器取标签
print soup.select('p.title')
print soup.select('a.sister')
返回:
[<p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]