Python爬虫学习之scrapy框架（二）爬取纵横月票榜

最新推荐文章于 2023-02-11 12:27:28 发布

JYKgl

最新推荐文章于 2023-02-11 12:27:28 发布

阅读量358

点赞数

分类专栏： python 文章标签： python django 爬虫

本文链接：https://blog.csdn.net/JYKgo/article/details/112063800

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Python爬虫学习之scrapy框架（二）爬取纵横月票榜

本篇和第一篇内容基本一样，就是练手的，具体的细节可以多看第一篇

项目资源链接

项目链接

文章目录

Python爬虫学习之scrapy框架（二）爬取纵横月票榜
项目资源链接
一.创建Scrapy项目
二.设置数据存储模板
- --item.py
三. 编写爬虫
- --book.py
四.编写数据处理脚本
- --pipelines.py
五.设置配置文件
- --settings.py增加如下内容
六. 执行爬虫

一.创建Scrapy项目

scrapy startproject douban
cd douban
scrapy genspider book qidian.com

二.设置数据存储模板

–item.py

class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    img_addr = scrapy.Field()
    writer = scrapy.Field()
    tag = scrapy.Field()
    detail_addr = scrapy.Field()
    intro = scrapy.Field()

三. 编写爬虫

网址链接: 纵横月票榜

–book.py

import scrapy
from douban.items import DoubanItem
from scrapy import Request
import re

class BookSpider(scrapy.Spider):
    name = 'book'
    allowed_domains = ['qidian.com']
    start_urls = ['https://www.qidian.com/finish?action=hidden&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=2&page=1']
    url_sets = set()

    def parse(self, response):
        List = response.xpath("//div[@class='book-img-text']/ul[@class='all-img-list cf']/li")
        for obj in List:
            item = DoubanItem()
            item['title'] = obj.xpath("./div[@class='book-mid-info']/h4/a/text()").extract()[0]
            item['img_addr'] = obj.xpath("./div[@class='book-img-box']/a/img/@src").extract()[0]
            item['writer'] = obj.xpath("./div[@class='book-mid-info']/p[@class='author']/a[@class='name']/text()").extract()[0]
            item['tag'] = obj.xpath("./div[@class='book-mid-info']/p[@class='author']/a[2]/text()").extract()[0]
            item['detail_addr'] = obj.xpath("./div[@class='book-mid-info']/h4/a/@href").extract()[0]
            item['intro'] = obj.xpath("./div[@class='book-mid-info']/p[@class='intro']/text()").extract()[0]
            

            yield item


        urls = response.xpath("//div[@class='lbf-pagination']/ul/li/a/@href").extract()
        for url in urls:
            url = 'https:'+url
            if url.startswith('https://www.qidian.com'):
                print('start with //www.qidian.com')
                if url in self.url_sets:
                    pass
                else:
                    self.url_sets.add(url)
                    yield self.make_requests_from_url(url)
            else:
                pass

四.编写数据处理脚本

数据库设计两张表，一张保存小说信息，另一张保存tag，两表关联，爬取数据成功后，编写程序进行简单处理，便可以进行一些简单的数据分析了，我将对数据库进行整理的程序页放在链接里的资源中了。
在这里插入图片描述

–pipelines.py

class DoubanPipeline:
    def process_item(self, item, spider):
            #保存图片
        # url = item['img_addr']
        # req = urllib.request.Request(url)
        # with urllib.request.urlopen(req) as pic:
        #     data = pic.read()
        #     file_name = os.path.join(r'D:\bookpic',item['name'] + '.jpg')
        #     with open(file_name, 'wb') as fp:
        #         fp.write(data)
        
        #保存到数据库

        info = [item['title'], item['img_addr'], item['writer'], item['tag'], item['detail_addr'], item['intro']]
        connection = pymysql.connect(host='localhost', user='root', password='', database='topnovel', charset='utf8')
        try:
            with connection.cursor() as cursor:
                sql = 'insert into shownovel_book (title, img_addr, writer, tag, detail_addr, intro) values (%s, %s, %s, %s, %s, %s)'
                affectedcount = cursor.execute(sql, info)
                print('成功修改{0}条数据'.format(affectedcount))
                connection.commit()
        except pymysql.DatabaseError:
            connection.rollback()
        finally:
            connection.close()
    
        return item

五.设置配置文件

–settings.py增加如下内容

ITEM_PIPELINES = {
   'novel.pipelines.NovelPipeline': 100,
}

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) G ecko/20100101 Firefox/52.0'
}

六. 执行爬虫

scrapy crawl book

JYKgl

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫学习之scrapy框架（二）爬取纵横月票榜

Python爬虫学习之scrapy框架（二）爬取纵横月票榜项目链接文章目录Python爬虫学习之scrapy框架（二）爬取纵横月票榜一.创建Scrapy项目二.设置数据存储模板--item.py三. 编写爬虫--fiction.py四.编写数据处理脚本--pipelines.py五.设置配置文件--settings.py增加如下内容六. 执行爬虫一.创建Scrapy项目scrapy startproject topnovelcd topnovel– scrapy genspider fictio
复制链接

扫一扫

专栏目录