在浏览这个网站(http://blog.jobbole.com/29281/)时,发现电子书不错。
就想download下来,也正好在学习爬虫,以下就用lxml及cssselect的方式下载下来,也当是个小练习。
1.download函数
import lxml.html
def download(url,user_agent='wswp',num_retires=2):
print 'Downloading:' ,url
headers = {'User-agent': user_agent}
request = urllib2.Request(url,headers=headers)
try:
html = urllib2.urlopen(request).read()
except urllib2.URLError as e:
print "Downloading error:", e.reason
html = None
if num_retires>0:
if hasattr(e,'code') and 500<= e.code <600:
return download(url, user_agent,num_retires-1)
return html
2.抓取数据(注意加粗的cssselect的使用)
if __name__ == "__main__":
url = 'http://blog.jobbole.com/29281/'
html = download(url)
for i in itertools.count(1):
tree = lxml.html.fromstring(html)
try:
td = tree.cssselect('ol>li>a')[i]
book = td.text_content()
href = td.get('href')
print book,href
except:
break
数据抓取完毕。