python中使用lxml与cssselect爬取电子书及链接

最新推荐文章于 2024-10-12 12:26:23 发布

wuhuajun001

最新推荐文章于 2024-10-12 12:26:23 发布

阅读量721

点赞数

分类专栏： python 学习笔记文章标签： python 爬虫

本文链接：https://blog.csdn.net/wuhuajun001/article/details/61952767

版权

学习笔记同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

python

2 篇文章 0 订阅

订阅专栏

在浏览这个网站（http://blog.jobbole.com/29281/）时，发现电子书不错。

就想download下来，也正好在学习爬虫，以下就用lxml及cssselect的方式下载下来，也当是个小练习。

1.download函数

import lxml.html

def download(url,user_agent='wswp',num_retires=2):
    print 'Downloading:' ,url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url,headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print "Downloading error:", e.reason
        html = None
        if num_retires>0:
            if hasattr(e,'code') and 500<= e.code <600:
                return download(url, user_agent,num_retires-1)
    return html

2.抓取数据（注意加粗的cssselect的使用）

if __name__ == "__main__":
    url = 'http://blog.jobbole.com/29281/'
    html = download(url)
    for i in itertools.count(1):
        tree = lxml.html.fromstring(html)
        try:
            td = tree.cssselect('ol>li>a')[i]
            book = td.text_content()
            href = td.get('href')
            print book,href
        except:
            break

数据抓取完毕。