本文使用的HTML在文章的最后,经过了一些简化,源文件在链接。
正则表达式
正则表达式的用法可以查看之前的文章,一般来说如果用python自带的urllib、urllib2库写爬虫的基本都会用到大量的正则表达式,而同样的,正则表达式也可以作为BeautifulSoup语句的任意一个参数,让你的目标元素查找工作极具灵活性。
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re
# html请复制下文
bsObj = BeautifulSoup(html, "html.parser")
images = bsObj.findAll("img", {"src":re.compile("\.\/page3_files\/img.*\.jpg")})
for image in images:
print(image["src"])
这个正则表达式的意思是,寻找所有./page3_files/img
开头以.jpg
结尾的图片。
值得注意的是,之前提过
bsObj.findAll({"class" : "green"})
可以用
bsObj.findAll(class_ = "green")
代替,但是使用正则表达式时,如果这样写
images = bsObj.findAll("img", src_ = re.compile("\.\/page3_files\/img.*\.jpg"))
是匹配不出来的。
不知道有没有办法解决,目前是看不出写法。
Lambda表达式
Lambda表达式是python里面函数式编程的一部分,不难理解,这里就不深入介绍了。
它本质上是一个函数,但是是一种匿名函数,可以当做其他函数的变量使用。它的定义方式是f(g(x), y),或者f(g(x), h(x))的形式。
lambda [arg1[,arg2,arg3....argN]]:expression
lambda语句中,冒号前是参数,可以有多个,用逗号隔开,冒号右边的返回值。
BeautifulSoup 允许我们把特定的函数类型当做findAll函数的参数,唯一的限制条件就是这些函数必须把一个标签作为参数且返回结果是布尔类型。BeautifulSoup用这个函数来判断每个标签对象,判断为真的就保留,其余的剔除。
from urllib2 import urlopen
from bs4 import BeautifulSoup
bsObj = BeautifulSoup(html, "html.parser")
tags = bsObj.findAll(lambda tag: len(tag.attrs) == 2)
for tag in tags:
print(tag)
这里的lambda函数是以tag为参数,返回是判断这个tag的标签数是不是等于二。
HTML
html = """
<html>
<head>
</head>
<body>
<div id="wrapper">
<img src="./page3_files/logo.jpg" style="float:left;">
<h1>Totally Normal Gifts</h1>
<table id="giftList">
<tbody>
<tr>
<th>
Item Title
</th>
<th>
Description
</th>
<th>
Cost
</th>
<th>
Image
</th>
</tr>
<tr id="gift1" class="gift">
<td>
Vegetable Basket
</td>
<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>
<td>
$15.00
</td>
<td>
<img src="./page3_files/img1.jpg">
</td>
</tr>
<tr id="gift2" class="gift">
<td>
Russian Nesting Dolls
</td>
<td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td>
<td>
$10,000.52
</td>
<td>
<img src="./page3_files/img2.jpg">
</td>
</tr>
<tr id="gift3" class="gift">
<td>
Fish Painting
</td>
<td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td>
<td>
$10,005.00
</td>
<td>
<img src="./page3_files/img3.jpg">
</td>
</tr>
<tr id="gift4" class="gift">
<td>
Dead Parrot
</td>
<td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td>
<td>
$0.50
</td>
<td>
<img src="./page3_files/img4.jpg">
</td>
</tr>
<tr id="gift5" class="gift">
<td>
Mystery Box
</td>
<td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td>
<td>
$1.50
</td>
<td>
<img src="./page3_files/img6.jpg">
</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
"""