对于简单的网页,正则表达式能够很好的工作,但是当网页稍微复杂,网页元素很多时,正则表达式工作起来可能很麻烦。这个时候如果利用BeautifulSoup这个库会得到意想不到的效果。
下载地址:http://www.crummy.com/software/BeautifulSoup/#Download/
参考文档:BS参考文档
下载下来之后解压,点击setup,然后就可以使用这个库了,如果下载的是最新的4话直接
<span style="font-size:14px;">from bs4 import BeautifulSoup</span>
如果还是不行,就在命令行中安装一次就可以了。
总体功能:实现抓取豆瓣新片榜(re)和北美票房版(bs4)
<span style="font-size:14px;"># -*- encoding:utf-8 -*-
from bs4 import BeautifulSoup
import re
import urllib2
import urllib
#抓取豆瓣排行榜电影
class Douban:
def __init__(self,page):
self.page = page
def getMovies(self,pattern):
mList = re.findall(pattern,self.page)
#print mList
return mList
Url = "http://movie.douban.com/chart"
page = urllib2.urlopen(Url).read().decode("utf-8")
douban = Douban(page)
#film name
pname = r'<img src=".*?" alt="(.*?)" class=""/>'
names = douban.getMovies(pname)
#film brief introduction
pbintro = r'<p class="pl">(.*?)</p>'
bintros = douban.getMovies(pbintro)
#score
pscore = r'<span class="rating_nums">(.*?)</span>'
scores = douban.getMovies(pscore)
#number of people's comments
pnumComment = r'<span class="pl">(.*?)</span>'
numcomments = douban.getMovies(pnumComment)
#新片榜
print r'豆瓣新片榜 · · · · · ·'
i = 0
for items in names:
print names[i],bintros[i],scores[i],numcomments[i]
i += 1
#下面用beautifulSoup来获取北美票房榜
soup = BeautifulSoup(page,"html.parser")
NAranking = soup.find('div',id="ranking").find('ul',id="listCont1")
print "\n\n\n北美票房榜---------------------------------------------------------------"
print soup.find('div',id="ranking").find('ul',id="listCont1").find('li').get_text()
i = 1
money = NAranking.find_all('span')
for item in NAranking.find_all('a'):
print i,item.get_text().strip(),money[i-1].get_text()
i += 1
</span>
效果截图:
哈哈,炫酷吧~不过由代码量可以看到bs4节省了很多的时间,方便易用!