第一步: 创建工程
使用方法:scrapy startproject 项目名
scrapy startproject douban
第二步: 创建爬虫的名字
scrapy genspider spider_name domain
spider_name :爬虫的名字,爬虫名设置为douban
domain :指定爬虫的域, 设置为douban.com
cd douban
scrapy genspider douban douban.com
第三步 设置settings
无视 robots 协议,设置ROBOTSTXT_OBEY 为false
ROBOTSTXT_OBEY = False
第四步 创建执行文件,run_douban.py
from scrapy.cmdline import execute
execute('scrapy crawl douban'.split())
第五步 修改item.py文件,设置爬虫具体的返回数据
爬取豆瓣书的https://book.douban.com/top250
书名name作者author价格price出版社 poblisher出版年份edition_year评分ratings评论人数comments
import scrapy
class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name=scrapy.Field()
author = scrapy.Field()
price=scrapy.Field()
poblisher = scrapy.Field()
edition_year=scrapy.Field()
ratings = scrapy.Field()
comments=scrapy.Field()
第六步 修改douban.py文件
import scrapy
from douban.items import DoubanItem
from lxml import etree
class DoubanSpider(scrapy.Spider):
name = 'douban'
allowed_domains = ['douban.com']
start_urls = ['https://book.douban.com/top250?start=0']
def parse(self, response):
# 将我们得到的数据封装到一个 `DoubanItem` 对象
book = DoubanItem()
#获取初始页url里面所有书的信息
book_list = response.xpath('//div[@class="article"]/div[@class="indent"]/table/tr[@class="item"]')
for href in book_list:
# extract()方法返回的都是字符串
#书的名字
name = href.xpath("td[2]/div[@class='pl2']/a/text()").extract_first().strip()
#书的作者,出版社,出版年份,数的价格
book_inform = href.xpath("td[2]/p[@class='pl']/text()").extract_first()
book_inform_list = book_inform.strip().split('/ ')
author = book_inform_list[0]
poblisher = book_inform_list[-3]
edition_year = book_inform_list[-2]
price = book_inform_list[-1]
#获取数的书的评分
ratings = href.xpath(
"td[2]/div[@class='star clearfix']/span[@class='rating_nums']/text()").extract_first()
#获取书的评论人数
comments_info=href.xpath("td[2]/div[@class='star clearfix']/span[@class='pl']/text()").extract_first()
#获取书的最有名的一句话
Famous_sentence= href.xpath("td[2]/p[@class='quote']/span[@class='inq']/text()").extract_first()
#导入正则表达式,匹配出评论里面的数字
import re
comments = re.findall('\d+', comments_info)[0]
book['name'] = name
book['author'] = author
book['poblisher'] = poblisher
book['edition_year'] = edition_year
book['price'] = price
book['ratings'] = ratings
book['comments'] = comments
book['Famous_sentence']=Famous_sentence
yield book
base_url = 'https://book.douban.com/top250?start={}'
#获取十页的url
get_total_count = 10
for i in range(1,get_total_count):
url = base_url.format(i*25)
print(url)
yield scrapy.Request(url, callback=self.parse)
第七步 修改 piplines.py文件
用csv格式进行存储
import csv
class DoubanPipeline(object):
def __init__(self):
path = 'D://douban_book_Top250.csv'
self.file = open(path, 'a+', encoding='utf-8')
self.writer = csv.writer(self.file)
def process_item(self, item, spider):
self.writer.writerow(
(item['name'],item['author'],item['poblisher'],item['edition_year'],item['price'],item['ratings'],item['comments'],item['Famous_sentence'])
)
return item
def close_spider(self, spider):
self.file.close()
结果展示:
参考文章:
https://www.v2ex.com/t/503441
https://blog.csdn.net/qq_43391383/article/details/86808069