使用scrapy创建第一个爬虫项目——爬取豆瓣图书top250
一、上篇我们知道了scrapy的安装和使用:
二、创建工程
打开cdm
输入 F: 进入到F盘,(再次进入我们想要创建项目的文件路径cd xxxxx\xxxx)
输入 scrapy startproject douban_book (使用命令在上一步的项目路径处创建一个爬虫项目)
二、抓取豆瓣图书top250
1、在items.py文件中书写要抓取的内容
import scrapy
class DoubanBookItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#pass
name = scrapy.Field()
price = scrapy.Field()
publisher = scrapy.Field()
ratings = scrapy.Field()
edition_year = scrapy.Field()
author = scrapy.Field()
pass
2、创建一个名为bookspider.py的文件,写入我呢吧的爬虫代码:
# -*- coding: utf-8 -*-
import scrapy
from douban_book.items import DoubanBookItem
import sys
sys.path.append(r"F:\My_PyPro\ScrapyProject\douban_book")
class BookSpider(scrapy.Spider):
"""docstring for BookSpider"""
name = 'douban-book'
allowed_domain = ['douban.com']
start_urls = ['https://book.douban.com/top250']
def parse(self, response):
yield scrapy.Request(response.url, callback=self.parse_page)
for page in response.xpath('//div[@class="paginator"]/a'):
link = page.xpath('@href').extract()[0]
yield scrapy.Request(link, callback=self.parse_page)
def parse_page(self, response):
for item in response.xpath('//tr[@class="item"]'):
book = DoubanBookItem()
book['name'] = item.xpath('td[2]/div[1]/a/@title').extract()[0]
book['ratings'] = item.xpath('td[2]/div[2]/span[@class="rating_nums"]/text()').extract()[0]
# book['ratings'] = item.xpath('td[2]/div[2]/span[2]/text()').extract()[0]
book_info = item.xpath('td[2]/p[1]/text()').extract()[0]
book_info_contents = book_info.strip().split(' / ')
book['author'] = book_info_contents[0]
book['publisher'] = book_info_contents[1]
book['edition_year'] = book_info_contents[2]
book['price'] = book_info_contents[3]
yield book
3、创建一个main.py文件,并在这个中编写如下代码
from scrapy.cmdline import execute
execute("scrapy crawl douban-book -o bookInfo.csv".split())
4、运行main.py,可以看到爬到的结果。
本文原创作者:冯一川(ifeng12358@163.com),未经作者授权同意,请勿转载。