数据采集与存储案例——基于Python爬虫框架Scrapy的网络数据爬取与MySQL数据持久化

学习BigData

已于 2024-03-27 12:32:35 修改

阅读量2.2k

点赞数 1

分类专栏： Python scrapy爬虫文章标签： python 爬虫数据库

于 2022-12-09 13:32:47 首次发布

本文链接：https://blog.csdn.net/weixin_52010459/article/details/128218755

版权

Python 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

scrapy爬虫

3 篇文章 0 订阅

订阅专栏

此案例需要预先安装pymsql
python3.7.4
scrapy2.7.1

一、安装scrapy框架

1、使用pip命令安装scrapy

pip install scrapy

在这里下载太慢可以使用国内源进行安装
如下：

pip install scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple

常用国内源如下：

阿里云 http://mirrors.aliyun.com/pypi/simple
 
豆瓣 http://pypi.douban.com/simple
 
清华大学  https://pypi.tuna.tsinghua.edu.cn/simple
 
中科大  http://pypi.mirrors.ustc.edu.cn/simple
 
网易云  https://mirrors.163.com/pypi/simple

错误：Fatal error in launcher: Unable to create process using

解决方案：在pip前面加上python-m

python -m pip install scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

错误分析：根据成功下载后的提示大概为pip版本过低
WARNING: You are using pip version 21.2.3; however, version 22.3.1 is available.
You should consider upgrading via the ‘D:\Program Files (x86)\Python\Python39\python.exe -m pip install --upgrade pip’ command.

二、创建scrapy项目

1、创建爬虫项目

（1）使用win+r输入cmd打开windows终端

（2）使用cd命令进入到想要创建爬虫项目的文件夹

（3）创建项目

scrapy startproject scrapy_mobile（项目名称）

（4）cd命令进入spiders文件夹下

cd \scrapy_mobile\scrapy_mobile\spiders

（5）创建爬虫类

scrapy genspider mobile（爬虫文件的名字）https://top.zol.com.cn/compositor/57/manu_1795.html（ 要爬虫网页的网址）

三、找到需要爬虫数据的所在位置

xpath的安装和使用
使用xpath工具查找
在这里进行测试的时候出现了[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to 如下错误，可以去查看301重定向错误

四、代码

1、spiders文件

spiders文件夹下创建的爬虫文件的名字（在第二步创建项目的（5））
在这里插入图片描述
针对下载多页文件的代码我在另一篇博客中进行了一些优化，可参考链接提取器CrawlSpider的使用

import scrapy
from scrapy_mobile.items import ScrapyMobileItem

class MobileSpider(scrapy.Spider):
    name = 'mobile'
    # 多页下载时只写域名
    allowed_domains = ['top.zol.com.cn']
    start_urls = ['https://top.zol.com.cn/compositor/57/manu_1795.html']


    base_url = 'https://top.zol.com.cn/compositor/57/manu_'
    page_list = [0,1673, 613, 50840, 544, 55731, 34645, 55075, 98, 35579]
    i = 0
    def parse(self, response):
        # name = '//div[@class="rank-list__item clearfix"]//div[@class="rank__name"]/a/text()'
        # price = '//div[@class="rank-list__item clearfix"]//div[@class="rank__price"]/text()'
        # score = '//div[@class="rank-list__item clearfix"]//div[@class="score clearfix"]/span/text()'
        # print(score)

        div_list = response.xpath('//div[@class="rank-list__item clearfix"]')

        for div in div_list:
            name = div.xpath('.//div[@class="rank__name"]/a/text()').extract_first()
            price = div.xpath('.//div[@class="rank__price"]/text()').extract_first()
            score = div.xpath('.//div[@class="score clearfix"]/span[1]/text()').extract_first()
            print(name,price,score)

            mobile = ScrapyMobileItem(name=name,price=price,score=score)
            # yield相当于return，获取一个mobile就将一个mobile交给pipeline
            yield mobile

#         每一页的爬取的业务逻辑全都是一样的，所以我们只需要将执行的那个页的请求再次调用parse方法就可以了
# oppo   https://top.zol.com.cn/compositor/57/manu_1673.html
# 华为    https://top.zol.com.cn/compositor/57/manu_613.html
# 荣耀    https://top.zol.com.cn/compositor/57/manu_50840.html
# 苹果    https://top.zol.com.cn/compositor/57/manu_544.html
# 红米    https://top.zol.com.cn/compositor/57/manu_55731.html
# 小米    https://top.zol.com.cn/compositor/57/manu_34645.html
# iqoo   https://top.zol.com.cn/compositor/57/manu_55075.html
# 三星    https://top.zol.com.cn/compositor/57/manu_98.html
# 一加    https://top.zol.com.cn/compositor/57/manu_35579.html

        if self.i < 9:
            self.i = self.i + 1
            url = self.base_url + str(self.page_list[self.i]) + '.html'
        # 就是scrapy的get请求
        # url就是请求地址
        # callback是你要执行的那个函数，不需要加（）
            yield scrapy.Request(url=url,callback=self.parse)

2、items.py

import scrapy


class ScrapyMobileItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 名字
    name = scrapy.Field()
    # 价格
    price = scrapy.Field()
    # 评分
    score = scrapy.Field()

3、settings.py

首先禁用robots协议，注释掉以下两行代码

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

MySQL
在哪里添加均可

DB_HOST = '192.168.64.133'
DB_PORT = 3306
DB_USER = 'root'
DB_PASSWROD = 'passwd'
DB_NAME = 'spider02'
DB_CHARSET = 'utf8'

pipelines

ITEM_PIPELINES = {
   # 管道可以有很多个，数值越低优先级越高
   'scrapy_mobile.pipelines.ScrapyMobilePipeline': 300,
   # MysqlPipeline
   'scrapy_mobile.pipelines.MysqlPipeline': 301
}

4、pipelines.py

from itemadapter import ItemAdapter

# 在settings中开启管道
class ScrapyMobilePipeline:
    # 在爬虫文件开始之前就执行的一个方法
    def open_spider(self,spider):
        self.fp = open('mobile.json','w',encoding='utf-8')
    # item就是yield后面的mobile对象
    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item
    # 在爬虫文件执行之后就执行的一个方法
    def close_spider(self,spider):
        self.fp.close()


# 加载settings文件
from scrapy.utils.project import get_project_settings
import pymysql


class MysqlPipeline:

    def open_spider(self,spider):
        settings = get_project_settings()
        self.host = settings['DB_HOST']
        self.port =settings['DB_PORT']
        self.user =settings['DB_USER']
        self.password =settings['DB_PASSWROD']
        self.name =settings['DB_NAME']
        self.charset =settings['DB_CHARSET']

        self.connect()

    def connect(self):
        self.conn = pymysql.connect(
                            host=self.host,
                            port=self.port,
                            user=self.user,
                            password=self.password,
                            db=self.name,
                            charset=self.charset
        )

        self.cursor = self.conn.cursor()


    def process_item(self, item, spider):

        sql = 'insert into mobile(name,price,score) values("{}","{}","{}")'.format(item['name'],item['price'],item['score'])
        # 执行sql语句
        self.cursor.execute(sql)
        # 提交
        self.conn.commit()

        return item


    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

五、在MySQL中创建表

本案例MySQL是在虚拟机中，未使用过MySQL可以去之前的博客查看MySQL在centos中的安装配置MySQL和Hive的安装配置

1、进入MySQL

mysql -uroot -p

2、创建名为spider02的数据库

create database spider02 charset=utf8;

3、使用数据库

use spider02;

4、创建表

create table mobile(
id int primary key auto_increment,
name varchar(128),
price varchar(128),
score varchar(128));

六、运行并查看结果

在终端中的spiders文件夹下执行

scrapy crawl mobile(文件名)

在这里插入图片描述
结果如图：

使用navicat连接数据库查看

使用命令在虚拟机中查看

select * from mobile;

在这里插入图片描述

总结：本文主要写了基于scrapy框架的python爬虫并完成了爬取数据的持久化

学习BigData

关注

1
点赞
踩
20

收藏

觉得还不错? 一键收藏
4
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录