随堂笔记

最新推荐文章于 2021-09-05 15:42:22 发布

hcxuke

最新推荐文章于 2021-09-05 15:42:22 发布

阅读量250

点赞数 2

分类专栏：爬虫文章标签：经验分享

本文链接：https://blog.csdn.net/hcxuke/article/details/109619841

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

随堂笔记

scrapy框架使用基本流程

创建项目: scrapy startproject dushu
创建爬虫: cd /dushu; scrapy genspider guoxue ““www.dushu.com””

打开guoxue.py,开始写代码.

class GuoxueSpider(scrapy.Spider):
    name = 'guoxue'
    allowed_domains = ['www.dushu.com']
    # 起始地址,一般需要修改.
    start_urls = ['https://www.dushu.com/book/1617.html']

    def parse(self, response):
        # 找到详情页的超链接
        detail_url_list = response.xpath('//div[@class="book-info"]//h3/a/@href')
        for detail_url in detail_url_list.getall():
            detail = 'https://www.dushu.com' + detail_url
            yield scrapy.Request(url=detail, callback=self.detail_parse)
		
        # 下一页地址.
        for i in range(2, 11):
            next_page = 'https://www.dushu.com/book/1617_%d.html' % i
            yield scrapy.Request(url=next_page, callback=self.parse)
	
    # 解析详情页的内容
    def detail_parse(self, response):
        book_title = response.xpath('string(//div[@class="book-title"])').extract_first()
        book_img = response.xpath('//div[@class="book-pic"]//img/@src').extract_first()
        price = response.xpath('//p[@class="price"]/span/text()').extract_first()
        author = response.xpath('string(//div[@class="book-details-left"]//table/tbody/tr[1]/td[2])').extract_first()
        book_brief, author_brief = response.xpath('//div[contains(@class, "txtsummary")]/text()')[:2].extract()
        book_brief, author_brief = book_brief.strip(), author_brief.strip()
        item = DushuItem()
        item['book_title'] = book_title
        item['book_img'] = book_img
        item['price'] = price
        item['author'] = author
        item['book_brief'] = book_brief
        item['author_brief'] = author_brief
        yield item

scrapy shell,利用这个shell可以进行代码调试.


- scrapy shell,利用这个shell可以进行代码调试.

- crawler spider

hcxuke

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录