操作系统:Windows10
浏览器:Chrome 75.0.3770.80(正式版本)(64 位)
开发环境:Anaconda 2019.03(Python3.7)
IDE:PyCharm 2019.1.1 (Professional Edition)
爬虫框架:Scrapy
数据库:MySQL Community 8.0.16.0
爬取目标:腾讯招聘全部职位
步骤一:搭建开发环境
1、下载安装Anaconda
下载地址:https://www.anaconda.com/distribution/
下载完成后双击“Anaconda3-2019.03-Windows-x86_64”进行安装
2、下载安装PyCharm
官网地址:http://www.jetbrains.com/
进入官网找到Download相关页面进行下载,由于我对PyCharm进行了PJ,在hosts文件中设置了禁止访问http://www.jetbrains.com/,无法打开官网,下载过程就不进行截图演示了。
除了官网下载,也可以到软件学堂进行下载。
下载地址:http://www.xue51.com/search.asp?wd=pycharm
可在其中自行选择版本,里面安装及PJ方法写得很详细,在此就不再赘述了。
3、下载安装MySQL
下载地址:https://cdn.mysql.com//Downloads/MySQLInstaller/mysql-installer-community-8.0.16.0.msi
下载后双击运行“mysql-installer-community-8.0.16.0.msi”,一路默认直接点击下一步即可,其中设置好自己的root密码
4、安装Scrapy
在windows开始菜单中找到“Anaconda3(64-bit)”点击其中的“Anaconda Prompt”打开Anaconda命令行。
在Anaconda命令行中输入:pip install scrapy进行安装。
5、安装twisted
安装方法同上,在Anaconda命令行中输入:pip install twisted进行安装。
6、安装pymysql
安装方法同上,在Anaconda命令行中输入:pip install pymysql进行安装。
至此,开发环境搭建完成
步骤二:创建爬虫项目
1、为了便于项目管理,我们先在"C:\Users\用户名\"中新建一个文件夹,我创建的文件夹是“MyProjects”
2、打开Anaconda命令行输入:cd MyProjects进入“MyProjects”文件夹
3、创建名为tencent的爬虫项目,输入:scrapy startproject tencent
4、输入:cd tencent进入项目文件夹
5、创建名为tencentRecruit的爬虫,输入:scrapy genspider tencentRecruit tencent.com
至此,爬虫项目创建完成
步骤三:在PyCharm中打开爬虫项目,开始开发爬虫
1、打开PyCharm,点击菜单栏的“File”->"Open..."
项目打开后如上图所示,Scrapy的项目结构如下所示
其中run.py原本是没有的,此文件是自己写的,目的是可以在scrapy中直接运行爬虫程序。
scrapy.cfg是爬虫项目的配置文件,一般情况不需要修改。
spiders文件夹用于存放爬虫的主程序,目前里面已经有了之前我们通过命令行创建的爬虫tencentRecruit.py。
items.py是爬虫项目的目标文件。
pipelines.py是爬虫项目的管道文件,用于对爬取到的数据进行处理(如:数据格式化、将数据写入文件、将数据写入数据库)。
settings.py是爬虫项目的设置文件,用于设置爬虫。
2、创建run.py文件,编写入口程序
from scrapy import cmdline
if __name__ == '__main__':
cmdline.execute('scrapy crawl tencentRecruit'.split())
3、修改settings.py文件,对爬虫进行设置
scrapy已经以注释的形式为我们写好了几乎所有会用到的设置
我们只要将需要用的设置取消注释即可,修改后的settings.py文件如下
# -*- coding: utf-8 -*-
# Scrapy settings for tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'tencent'
SPIDER_MODULES = ['tencent.spiders']
NEWSPIDER_MODULE = 'tencent.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 此处取消注释,并将源文件的内容修改为浏览器的USER-AGENT,设置USER-AGENT的作用是防止反爬
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 ' \
'Safari/537.36 '
# Obey robots.txt rules
# 此处取消注释,并将值设置为False,作用是让爬虫程序不遵守目标网站的robots规则,robots规则用于规定网站内那些目录允许爬虫爬取,那些目录不允许爬虫爬取
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 此处取消注释,使爬虫程序可以进行并发爬取,提高爬取效率
CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# 此处取消注释,作用是禁用COOKIES,用于防止反爬,需要注意的是:如果要爬取的内容不需要登陆即可禁用COOKIES,如果要爬取的内容需要登陆,不能禁用COOKIES,我们爬取的网站内容不需要登陆,所以此处设置禁用COOKIES
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# }
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'tencent.middlewares.TencentSpiderMiddleware': 543,
# }
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'tencent.middlewares.TencentDownloaderMiddleware': 543,
# }
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
# 'tencent.pipelines.TencentPipeline': 300,
# }
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#