scrapy mysql 豆瓣_利用Scrapy爬取豆瓣图书并保存至Mysql数据库

最新推荐文章于 2021-04-23 18:19:10 发布

不胖的羊

最新推荐文章于 2021-04-23 18:19:10 发布

阅读量819

点赞数 1

文章标签： scrapy mysql 豆瓣

本文链接：https://blog.csdn.net/weixin_30700095/article/details/113416138

版权

本文介绍了如何使用Scrapy爬虫框架爬取豆瓣图书数据，包括安装Scrapy、创建项目、定义数据项、分析网页结构、编写爬虫代码，以及将爬取的数据存储到MySQL数据库中。

摘要由CSDN通过智能技术生成

Scrapy是一个纯Python语言写的爬虫框架，本次用它来爬取豆瓣图书的数据。

准备工作

没有安装Scrapy的同学需要安装一下，有两种方式安装：

安装了Anaconda的同学直接在命令行输入conda install scrapy，片刻后即可安装完成，这也是Scrapy官方推荐的安装方式

安装了 pip 的同学，在命令行输入pip install scrapy，不过根据操作系统的不同，可能需要先安装别的依赖。

安装完成后，把命令行切换到自己的工作目录，新建一个Scrapy工程：

直接在命令行输入scrapy startproject douban

注释： Scrapy安装完成后，自带了一些操作命令

Scrapy框架会在当前目录下生成一个 douban 目录，并生成一些代码模板

b29375404479

douban目录下的内容

douban/douban 下面是代码放置的地方，douban/scrapy.cfg 是 Scrapy 的一些配置

构建要爬取的数据项

首先来看一下我们要爬取的网站豆瓣图书

b29375404479

豆瓣图书

可以看到，每本书的属性有：书名，作者，类别，评分，评价人数，图书介绍。

将命令行切换至 douban 目录：cd douban

然后输入 scrapy genspider doubanspider https://read.douban.com/

框架会自动在 spiders 目录下生成一个 doubanspider.py，这就是爬虫的代码模板，我们等一下再来修改，先来创建一下我们要爬取的数据项。

编辑 items.py 如下

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class DoubanItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

book_name = scrapy.Field() # 书名

author = scrapy.Field() # 作者

class_ = scrapy.Field() # 类别

grade = scrapy.Field() # 评分

count = scrapy.Field() # 人数

introduction = scrapy.Field() # 介绍

注释：class_字段有一个下划线是因为要与 python 里的关键字 class 区分开

网站结构分析

不同的网站有不同的结构，我们的爬虫要爬取哪个网站，就要先分析一下对应的网站结构。看一下豆瓣图书的网站结构，在浏览器中按下 F12 打开开发者工具。

可以看到，每本图书的信息都保存在一个 class="info" 的 div 里面，其对应的 xpath 路径为 "//div[@class="info"]"

b29375404479

豆瓣图书网站构成

书名在 class="title" 的 div 下面的一个标签里面,其对应的 xpath 路径为 ".//div[@class="title"]/a/text()"

b29375404479

书名

作者在 class="labeled-text" 的 span 下面的一个标签里面，其对应的

xpath 路径为 ".//span[@class="labeled-text"]/a/text()"

b29375404479

作者

类别在 itemprop="genre" 的一个 span 里面，其对应的 xpath 路径为 ".//span[@itemprop="genre"]/text()"

b29375404479

类别

评分在 class="rating-average" 的一个 span 里面，其对应的 xpath 路径为".//span[@class="rating-average"]/text()"

b29375404479

评分

评价人数在 class="ratings-link" 的 a 下面的一个标签里面，其对应的 xpath 路径为 ".//a[@class="ratings-link"]/span/text()"

b29375404479

评价人数

图书介绍在 class="article-desc-brief" 的一个 div 里面，其对应的 xpath 路径为 ".//div[@class="article-desc-brief"]/text()"

b29375404479

图书介绍

下一页的链接在 class="next" 的 li 下面的一个标签里面的 href 属性里面，其对应的 xpath 路径为 "//li[@class="next"]/a/@href"

b29375404479

后页

注释：XPath 是一门在 XML 文档中查找信息的语言，在这里查看XPath的语法

开始写爬虫

修改 doubanspider.py 如下：

# -*- coding: utf-8 -*-

import scrapy

from douban.items import DoubanItem # 导入要爬取的数据项

class DoubanspiderSpider(scrapy.Spider):

name = 'doubanspider'

allowed_domains = ['read.douban.com']

# start_urls = ['http://read.douban.com/']

def start_requests(self): # 构建Start_Request

url = "https://read.douban.com/kind/114"

yield scrapy.Request(url, callback=self.parse)

def parse(self, response): # 爬取网站得到response后，自动回调parse方法

item = DoubanItem()

info_list = response.xpath('//div[@class="info"]')

print(info_list)

for info in info_list:

item['book_name'] = info.xpath('.//div[@class="title"]/a/text()').extract_first()

item['author'] = info.xpath('.//span[@class="labeled-text"]/a/text()').extract_first()

item['class_'] = info.xpath('.//span[@itemprop="genre"]/text()').extract_first()

item['grade'] = info.xpath('.//span[@class="rating-average"]/text()').extract_first()

item['count'] = info.xpath('.//a[@class="ratings-link"]/span/text()').extract_first()

item['introduction'] = info.xpath('.//div[@class="article-desc-brief"]/text()').extract_first()

yield item

next_temp_url = response.xpath('//li[@class="next"]/a/@href').extract_first()

if next_temp_url is not None:

next_url = response.urljoin(next_temp_url)

yield scrapy.Request(next_url)

为了防止网站禁止爬虫，我们需要修改一下 settings.py 里的几项：

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'

上面的 User-Agent 引用自我的 chrome 浏览器，根据需要改成自己的

b29375404479

User-Agent

接下来我们测试一下爬虫能不能正常运行：

在命令行输入 scrapy crawl doubanspider -o doubanread.csv

不出错的话，会在命令行打印出我们爬取的每个数据项，然后保存到

doubanread.csv 文件里

b29375404479

doubanread.csv

注释：这里生成的csv文件用Excel直接打开会发现乱码，具体解决办法我还没找到，这里使用notepad++打开。

更新：经你的发圈提醒，使用sublime text 打开 CSV 文件，然后另存为 utf8 with bom 的格式，可解决 excel 打开乱码问题。

将数据保存到Mysql数据库

首先新建数据库，我在 bistu 数据库下新建了一个 doubanread 表

b29375404479

数据库建立

对应的 SQL 如下：

Navicat MySQL Data Transfer

Source Server : localhost

Source Server Version : 50717

Source Host : localhost:3306

Source Database : bistu

Target Server Type : MYSQL

Target Server Version : 50717

File Encoding : 65001

Date: 2017-10-22 16:47:44

SET FOREIGN_KEY_CHECKS=0;

-- ----------------------------

-- Table structure for doubanread

-- ----------------------------

DROP TABLE IF EXISTS `doubanread`;

CREATE TABLE `doubanread` (

`id` int(11) NOT NULL AUTO_INCREMENT,

`book_name` varchar(255) DEFAULT NULL,

`author` varchar(255) DEFAULT NULL,

`class_` varchar(255) DEFAULT NULL,

`grade` varchar(255) DEFAULT NULL,

`count` int(11) DEFAULT NULL,

`introduction` varchar(255) DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=InnoDB AUTO_INCREMENT=1409 DEFAULT CHARSET=utf8;

然后修改 pipelines.py 文件：

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql as pq # 导入pymysql

class DoubanPipeline(object):

def __init__(self):

self.conn = pq.connect(host='localhost', user='root',

passwd='123456', db='bistu', charset='utf8')

self.cur = self.conn.cursor()

def process_item(self, item, spider):

book_name = item.get("book_name", "N/A") # 有的图书有数据项缺失，这里做了容错处理

author = item.get("author", "N/A")

class_ = item.get("class_", "N/A")

grade = item.get("grade", "N/A")

count = item.get("count", "N/A")

introduction = item.get("introduction", "N/A")

sql = "insert into doubanread(book_name, author, class_, grade, count, introduction) VALUES (%s, %s, %s, %s, %s, %s)"

self.cur.execute(sql, (book_name, author, class_, grade, count, introduction))

self.conn.commit()

def close_spider(self, spider):

self.cur.close()

self.conn.close()

注释：你没有猜错， pipelines.py 就是 scrapy 框架用来与数据库交互的地方，在此之前，我们需要安装 pysql ，安装方法与 scrapy 的安装方法一样：conda install scrapy 或 pip install scrapy

只修改了 pipelines.py 还不行，我们还需要修改一个文件，就是之前的 settings.py 文件，找到这里,，将注释取消掉：

# Configure item pipelines

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'douban.pipelines.DoubanPipeline': 300,

}

最后在命令行里输入 scrapy crawl doubanspider 运行项目。

查看数据库，可以看到我们爬取的数据，缺失的数据项补为了Null

b29375404479

爬取到的数据

b29375404479

Game Over

不胖的羊

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
scrapy mysql 豆瓣_利用Scrapy爬取豆瓣图书并保存至Mysql数据库

Scrapy是一个纯Python语言写的爬虫框架，本次用它来爬取豆瓣图书的数据。准备工作没有安装Scrapy的同学需要安装一下，有两种方式安装：安装了Anaconda的同学直接在命令行输入conda install scrapy，片刻后即可安装完成，这也是Scrapy官方推荐的安装方式安装了 pip 的同学，在命令行输入pip install scrapy，不过根据操作系统的不同，可能需要先安装别...
复制链接

扫一扫