python中使用lxml与cssselect爬取电子书及链接

最新推荐文章于 2023-04-06 22:27:20 发布

weixin_30691871

最新推荐文章于 2023-04-06 22:27:20 发布

阅读量96

点赞数

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/bigbrother/p/6545883.html

版权

---恢复内容开始---

在浏览这个网站（http://blog.jobbole.com/29281/）时，发现电子书不错。

就想download下来，也正好在学习爬虫，以下就用lxml及cssselect的方式下载下来，也当是个小练习。

1.download函数

import lxml.html

def download(url,user_agent='wswp',num_retires=2):
    print 'Downloading:' ,url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url,headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print "Downloading error:", e.reason
        html = None
        if num_retires>0:
            if hasattr(e,'code') and 500<= e.code <600:
                return download(url, user_agent,num_retires-1)
    return html

2.抓取数据（注意加粗的cssselect的使用）

if __name__ == "__main__":
    url = 'http://blog.jobbole.com/29281/'
    html = download(url)
    for i in itertools.count(1):
        tree = lxml.html.fromstring(html)
        try:
            td = tree.cssselect('ol > li > a')[i]
            book = td.text_content()
            href = td.get('href')
            print book,href
        except:
            break