注:本文代码以网站http://www.pythonscraping.com/pages/page3.html为例
1.获取网页HTML内容,传到BeautifulSoup对象。
import requests
from bs4 import BeautifulSoup
url = 'http://www.pythonscraping.com/pages/page3.html'
response = requests.get(url)
soup = BeautifulSoup(response.text)
<html>
<head>
<style>
img{
width:75px;
}
table{
width:50%;
}
td{
margin:10px;
padding:10px;
}
.wrapper{
width:800px;
}
.excitingNote{
font-style:italic;
font-weight:bold;
}
</style>
</head>
<body>
<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;"/>
<h1>
Totally Normal Gifts
</h1>
<div id="content">
Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.
<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:
<br/>
123 Main St.
<br/>
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.
</p>
</div>
<table id="giftList">
<tr>
<th>
Item Title
</th>
<th>
Description
</th>
<th>
Cost
</th>
<th>
Image
</th>
</tr>
<tr class="gift" id="gift1">
<td>
Vegetable Basket
</td>
<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">
Now with super-colorful bell peppers!
</span>
</td>
<td>
$15.00
</td>
<td>
<img src="../img/gifts/img1.jpg"/>
</td>
</tr>
<tr class="gift" id="gift2">
<td>
Russian Nesting Dolls
</td>
<td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"!
<span class="excitingNote">
8 entire dolls per set! Octuple the presents!
</span>
</td>
<td>
$10,000.52
</td>
<td>
<img src="../img/gifts/img2.jpg"/>
</td>
</tr>
<tr class="gift" id="gift3">
<td>
Fish Painting
</td>
<td>
If something seems fishy about this painting, it's because it's a fish!
<span class="excitingNote">
Also hand-painted by trained monkeys!
</span>
</td>
<td>
$10,005.00
</td>
<td>
<img src="../img/gifts/img3.jpg"/>
</td>
</tr>
<tr class="gift" id="gift4">
<td>
Dead Parrot
</td>
<td>
This is an ex-parrot!
<span class="excitingNote">
Or maybe he's only resting?
</span>
</td>
<td>
$0.50
</td>
<td>
<img src="../img/gifts/img4.jpg"/>
</td>
</tr>
<tr class="gift" id="gift5">
<td>
Mystery Box
</td>
<td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining.
<span class="excitingNote">
Keep your friends guessing!
</span>
</td>
<td>
$1.50
</td>
<td>
<img src="../img/gifts/img6.jpg"/>
</td>
</tr>
</table>
<div id="footer">
© Totally Normal Gifts, Inc.
<br/>
+234 (617) 863-0736
</div>
</div>
</body>
</html>
2.提取指定的标签如 h1
print(soup.h1)
<h1>Totally Normal Gifts</h1>
.get_text()将所有标签清除,返回一个只包含文本的字符串:
print(soup.h1.get_text())
Totally Normal Gifts
3.find()和findAll()
BeautifulSoup中对两者的定义:
findAll(tag,attributes,recursive,text,limit,keywords)
find(tag,attributes,recursive,text,keywords)
----返回所有标题标签的列表
print(soup.findAll({'h1'}))
----attributes是一个用Python字典封装一个标签的若干属性和对应属性值,例如下面会返回tr标签中属性class为gift的内容:
print(soup.findAll('tr',{'class':'gift'}))
----参数recursive为布尔变量,为true则查找所有子标签,false则只查找文档的一级标签。
----参数text匹配标签的文本内容,如果我们想要知道网页中包含‘Totally Normal Gifts’内容的标签的数量,可以这样:
name = soup.findAll(text='Totally Normal Gifts')
print(name)
print(len(name))
['Totally Normal Gifts']
1
----参数limit=x表示你只对网页中获取的前x项感兴趣。limit=1时findAll相当于find
----参数keyword为冗余功能,此处不做介绍。
4.处理子标签以及其他后代标签(.children)
查找第一个tr标签的子标签:
name = soup.find('tr').children
5.处理兄弟标签(.next_siblings)(.previous_siblings)
6.父标签的处理(.parents)