Python爬虫学习之scrapy框架(二)爬取纵横月票榜

Python爬虫学习之scrapy框架(二)爬取纵横月票榜

本篇和第一篇内容基本一样,就是练手的,具体的细节可以多看第一篇

项目资源链接

项目链接

一.创建Scrapy项目

scrapy startproject douban
cd douban
scrapy genspider book qidian.com

二.设置数据存储模板

–item.py

class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    img_addr = scrapy.Field()
    writer = scrapy.Field()
    tag = scrapy.Field()
    detail_addr = scrapy.Field()
    intro = scrapy.Field()

三. 编写爬虫

网址链接: 纵横月票榜

–book.py

import scrapy
from douban.items import DoubanItem
from scrapy import Request
import re

class BookSpider(scrapy.Spider):
    name = 'book'
    allowed_domains = ['qidian.com']
    start_urls = ['https://www.qidian.com/finish?action=hidden&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=2&page=1']
    url_sets = set()

    def parse(self, response):
        List = response.xpath("//div[@class='book-img-text']/ul[@class='all-img-list cf']/li")
        for obj in List:
            item = DoubanItem()
            item['title'] = obj.xpath("./div[@class='book-mid-info']/h4/a/text()").extract()[0]
            item['img_addr'] = obj.xpath("./div[@class='book-img-box']/a/img/@src").extract()[0]
            item['writer'] = obj.xpath("./div[@class='book-mid-info']/p[@class='author']/a[@class='name']/text()").extract()[0]
            item['tag'] = obj.xpath("./div[@class='book-mid-info']/p[@class='author']/a[2]/text()").extract()[0]
            item['detail_addr'] = obj.xpath("./div[@class='book-mid-info']/h4/a/@href").extract()[0]
            item['intro'] = obj.xpath("./div[@class='book-mid-info']/p[@class='intro']/text()").extract()[0]
            

            yield item


        urls = response.xpath("//div[@class='lbf-pagination']/ul/li/a/@href").extract()
        for url in urls:
            url = 'https:'+url
            if url.startswith('https://www.qidian.com'):
                print('start with //www.qidian.com')
                if url in self.url_sets:
                    pass
                else:
                    self.url_sets.add(url)
                    yield self.make_requests_from_url(url)
            else:
                pass

四.编写数据处理脚本

数据库设计两张表,一张保存小说信息,另一张保存tag,两表关联,爬取数据成功后,编写程序进行简单处理,便可以进行一些简单的数据分析了,我将对数据库进行整理的程序页放在链接里的资源中了。
在这里插入图片描述

–pipelines.py

class DoubanPipeline:
    def process_item(self, item, spider):
            #保存图片
        # url = item['img_addr']
        # req = urllib.request.Request(url)
        # with urllib.request.urlopen(req) as pic:
        #     data = pic.read()
        #     file_name = os.path.join(r'D:\bookpic',item['name'] + '.jpg')
        #     with open(file_name, 'wb') as fp:
        #         fp.write(data)
        
        #保存到数据库

        info = [item['title'], item['img_addr'], item['writer'], item['tag'], item['detail_addr'], item['intro']]
        connection = pymysql.connect(host='localhost', user='root', password='', database='topnovel', charset='utf8')
        try:
            with connection.cursor() as cursor:
                sql = 'insert into shownovel_book (title, img_addr, writer, tag, detail_addr, intro) values (%s, %s, %s, %s, %s, %s)'
                affectedcount = cursor.execute(sql, info)
                print('成功修改{0}条数据'.format(affectedcount))
                connection.commit()
        except pymysql.DatabaseError:
            connection.rollback()
        finally:
            connection.close()
    
        return item

五.设置配置文件

–settings.py增加如下内容

ITEM_PIPELINES = {
   'novel.pipelines.NovelPipeline': 100,
}
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) G ecko/20100101 Firefox/52.0'
}

六. 执行爬虫

scrapy crawl book
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值