Scrapy并发爬取腾讯招聘全部内容异步存储到MySQL（从环境搭建到完成开发）

最新推荐文章于 2022-07-11 07:35:00 发布

置顶

MrCJY

最新推荐文章于 2022-07-11 07:35:00 发布

阅读量1k

点赞数 7

分类专栏： python3 MySQL 文章标签： scrapy 爬虫并发异步 python3

本文链接：https://blog.csdn.net/weixin_38818798/article/details/91368604

版权

本文介绍了如何在Windows环境下搭建Python Scrapy爬虫项目，目标是爬取腾讯招聘的全部职位信息并存储到MySQL数据库。首先，通过Anaconda安装相关开发工具和数据库，然后创建Scrapy项目，编写爬虫和数据库存储代码。爬取过程中，分析了目标网站的动态加载问题，选择不依赖浏览器加载js的解决方案，直接请求API获取JSON数据。最后，利用Scrapy的并发功能和Twisted库实现异步存储，提高爬取效率。

摘要由CSDN通过智能技术生成

操作系统：Windows10

浏览器：Chrome 75.0.3770.80（正式版本）（64 位）

开发环境：Anaconda 2019.03（Python3.7）

IDE：PyCharm 2019.1.1 (Professional Edition)

爬虫框架：Scrapy

数据库：MySQL Community 8.0.16.0

爬取目标：腾讯招聘全部职位

步骤一：搭建开发环境

1、下载安装Anaconda

下载地址：https://www.anaconda.com/distribution/

下载完成后双击“Anaconda3-2019.03-Windows-x86_64”进行安装

2、下载安装PyCharm

官网地址：http://www.jetbrains.com/

进入官网找到Download相关页面进行下载，由于我对PyCharm进行了PJ，在hosts文件中设置了禁止访问http://www.jetbrains.com/，无法打开官网，下载过程就不进行截图演示了。

除了官网下载，也可以到软件学堂进行下载。

下载地址：http://www.xue51.com/search.asp?wd=pycharm

可在其中自行选择版本，里面安装及PJ方法写得很详细，在此就不再赘述了。

3、下载安装MySQL

下载地址：https://cdn.mysql.com//Downloads/MySQLInstaller/mysql-installer-community-8.0.16.0.msi

下载后双击运行“mysql-installer-community-8.0.16.0.msi”，一路默认直接点击下一步即可，其中设置好自己的root密码

4、安装Scrapy

在windows开始菜单中找到“Anaconda3（64-bit）”点击其中的“Anaconda Prompt”打开Anaconda命令行。

在Anaconda命令行中输入：pip install scrapy进行安装。

5、安装twisted

安装方法同上，在Anaconda命令行中输入：pip install twisted进行安装。

6、安装pymysql

安装方法同上，在Anaconda命令行中输入：pip install pymysql进行安装。

至此，开发环境搭建完成

步骤二：创建爬虫项目

1、为了便于项目管理，我们先在"C:\Users\用户名\"中新建一个文件夹，我创建的文件夹是“MyProjects”

2、打开Anaconda命令行输入：cd MyProjects进入“MyProjects”文件夹

3、创建名为tencent的爬虫项目，输入：scrapy startproject tencent

4、输入：cd tencent进入项目文件夹

5、创建名为tencentRecruit的爬虫，输入：scrapy genspider tencentRecruit tencent.com

至此，爬虫项目创建完成

步骤三：在PyCharm中打开爬虫项目，开始开发爬虫

1、打开PyCharm，点击菜单栏的“File”->"Open..."

项目打开后如上图所示，Scrapy的项目结构如下所示

其中run.py原本是没有的，此文件是自己写的，目的是可以在scrapy中直接运行爬虫程序。

scrapy.cfg是爬虫项目的配置文件，一般情况不需要修改。

spiders文件夹用于存放爬虫的主程序，目前里面已经有了之前我们通过命令行创建的爬虫tencentRecruit.py。

items.py是爬虫项目的目标文件。

pipelines.py是爬虫项目的管道文件，用于对爬取到的数据进行处理（如：数据格式化、将数据写入文件、将数据写入数据库）。

settings.py是爬虫项目的设置文件，用于设置爬虫。

2、创建run.py文件，编写入口程序

from scrapy import cmdline

if __name__ == '__main__':
    cmdline.execute('scrapy crawl tencentRecruit'.split())

3、修改settings.py文件，对爬虫进行设置

scrapy已经以注释的形式为我们写好了几乎所有会用到的设置

我们只要将需要用的设置取消注释即可，修改后的settings.py文件如下

# -*- coding: utf-8 -*-

# Scrapy settings for tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tencent'

SPIDER_MODULES = ['tencent.spiders']
NEWSPIDER_MODULE = 'tencent.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 此处取消注释，并将源文件的内容修改为浏览器的USER-AGENT，设置USER-AGENT的作用是防止反爬
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 ' \
             'Safari/537.36 '

# Obey robots.txt rules
# 此处取消注释，并将值设置为False，作用是让爬虫程序不遵守目标网站的robots规则，robots规则用于规定网站内那些目录允许爬虫爬取，那些目录不允许爬虫爬取
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 此处取消注释，使爬虫程序可以进行并发爬取，提高爬取效率
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# 此处取消注释，作用是禁用COOKIES，用于防止反爬，需要注意的是：如果要爬取的内容不需要登陆即可禁用COOKIES，如果要爬取的内容需要登陆，不能禁用COOKIES，我们爬取的网站内容不需要登陆，所以此处设置禁用COOKIES
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'tencent.middlewares.TencentSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'tencent.middlewares.TencentDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
    # 'tencent.pipelines.TencentPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#