随堂笔记
scrapy框架使用基本流程
-
创建项目: scrapy startproject dushu
-
创建爬虫: cd /dushu; scrapy genspider guoxue ““www.dushu.com””
-
打开guoxue.py,开始写代码.
class GuoxueSpider(scrapy.Spider): name = 'guoxue' allowed_domains = ['www.dushu.com'] # 起始地址,一般需要修改. start_urls = ['https://www.dushu.com/book/1617.html'] def parse(self, response): # 找到详情页的超链接 detail_url_list = response.xpath('//div[@class="book-info"]//h3/a/@href') for detail_url in detail_url_list.getall(): detail = 'https://www.dushu.com' + detail_url yield scrapy.Request(url=detail, callback=self.detail_parse) # 下一页地址. for i in range(2, 11): next_page = 'https://www.dushu.com/book/1617_%d.html' % i yield scrapy.Request(url=next_page, callback=self.parse) # 解析详情页的内容 def detail_parse(self, response): book_title = response.xpath('string(//div[@class="book-title"])').extract_first() book_img = response.xpath('//div[@class="book-pic"]//img/@src').extract_first() price = response.xpath('//p[@class="price"]/span/text()').extract_first() author = response.xpath('string(//div[@class="book-details-left"]//table/tbody/tr[1]/td[2])').extract_first() book_brief, author_brief = response.xpath('//div[contains(@class, "txtsummary")]/text()')[:2].extract() book_brief, author_brief = book_brief.strip(), author_brief.strip() item = DushuItem() item['book_title'] = book_title item['book_img'] = book_img item['price'] = price item['author'] = author item['book_brief'] = book_brief item['author_brief'] = author_brief yield item
-
scrapy shell,利用这个shell可以进行代码调试.
em
- scrapy shell,利用这个shell可以进行代码调试.
- crawler spider