scrapy 爬取-豆瓣

最新推荐文章于 2022-06-21 13:38:11 发布

weixin_36413027

最新推荐文章于 2022-06-21 13:38:11 发布

阅读量129

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/weixin_36413027/article/details/94762831

版权

python爬虫专栏收录该内容

0 篇文章 0 订阅

订阅专栏

第一步：创建工程

使用方法：scrapy startproject 项目名

    scrapy startproject douban

第二步：创建爬虫的名字

scrapy genspider spider_name domain
spider_name ：爬虫的名字，爬虫名设置为douban
domain ：指定爬虫的域, 设置为douban.com

 cd douban
 scrapy genspider douban douban.com

第三步设置settings

无视 robots 协议，设置ROBOTSTXT_OBEY 为false
ROBOTSTXT_OBEY = False

第四步创建执行文件，run_douban.py

from scrapy.cmdline import execute
execute('scrapy crawl douban'.split())

第五步修改item.py文件，设置爬虫具体的返回数据

爬取豆瓣书的https://book.douban.com/top250
书名name作者author价格price出版社 poblisher出版年份edition_year评分ratings评论人数comments

import scrapy
class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name=scrapy.Field()
    author = scrapy.Field()
    price=scrapy.Field()
    poblisher = scrapy.Field()
    edition_year=scrapy.Field()
    ratings = scrapy.Field()
    comments=scrapy.Field()

第六步修改douban.py文件

import scrapy
from douban.items import DoubanItem
from lxml import etree
class DoubanSpider(scrapy.Spider):
    name = 'douban'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/top250?start=0']

    def parse(self, response):
        # 将我们得到的数据封装到一个 `DoubanItem` 对象
        book = DoubanItem()
        #获取初始页url里面所有书的信息
        book_list = response.xpath('//div[@class="article"]/div[@class="indent"]/table/tr[@class="item"]')

        for href in book_list:
            # extract()方法返回的都是字符串
            #书的名字
            name = href.xpath("td[2]/div[@class='pl2']/a/text()").extract_first().strip()
            #书的作者，出版社，出版年份，数的价格
            book_inform = href.xpath("td[2]/p[@class='pl']/text()").extract_first()
            book_inform_list = book_inform.strip().split('/ ')
            author = book_inform_list[0]
            poblisher = book_inform_list[-3]
            edition_year = book_inform_list[-2]
            price = book_inform_list[-1]
            #获取数的书的评分
            ratings = href.xpath(
                    "td[2]/div[@class='star clearfix']/span[@class='rating_nums']/text()").extract_first()
            #获取书的评论人数
            comments_info=href.xpath("td[2]/div[@class='star clearfix']/span[@class='pl']/text()").extract_first()
            #获取书的最有名的一句话
            Famous_sentence= href.xpath("td[2]/p[@class='quote']/span[@class='inq']/text()").extract_first()
            #导入正则表达式，匹配出评论里面的数字
            import re
            comments = re.findall('\d+', comments_info)[0]
            book['name'] = name
            book['author'] = author
            book['poblisher'] = poblisher
            book['edition_year'] = edition_year
            book['price'] = price
            book['ratings'] = ratings
            book['comments'] = comments
            book['Famous_sentence']=Famous_sentence
            yield book
        base_url = 'https://book.douban.com/top250?start={}'
        #获取十页的url
        get_total_count = 10
        for i in range(1,get_total_count):
            url = base_url.format(i*25)
            print(url)
            yield  scrapy.Request(url, callback=self.parse)

第七步修改 piplines.py文件

用csv格式进行存储

import csv
class DoubanPipeline(object):
    def __init__(self):
        path = 'D://douban_book_Top250.csv'
        self.file = open(path, 'a+', encoding='utf-8')
        self.writer = csv.writer(self.file)

    def process_item(self, item, spider):
        self.writer.writerow(
            (item['name'],item['author'],item['poblisher'],item['edition_year'],item['price'],item['ratings'],item['comments'],item['Famous_sentence'])
        )
        return item

    def close_spider(self, spider):
        self.file.close()

结果展示：

在这里插入图片描述
参考文章：
https://www.v2ex.com/t/503441
https://blog.csdn.net/qq_43391383/article/details/86808069

weixin_36413027

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy 爬取-豆瓣

第一步创建工程使用方法：scrapy startproject 项目名 scrapy startproject douban第二步创建爬虫的名字scrapy genspider spider_name domainspider_name ：爬虫的名字，爬虫名设置为doubandomain ：指定爬虫的域, 设置为douban.com cd douban scrapy gen...
复制链接

扫一扫