Python分布式爬虫打造搜索引擎
一、通过CrawlSpider对招聘网站进行整站爬取
1、创建拉勾网爬虫项目 - CrawlSpider的使用
推荐工具:cmder , 下载地址:http://cmder.net/ → 下载full版本,使我们在windows环境下也可以使用linux部分命令
在终端/cmder中,进入我们项目,执行:scrapy genspider --list :查看可使用的初始化版本
ailable templates: basic # crawl # csvfeed # xmlfeed # # 执行命令:-t 表示通过模板生成 scrapy genspider -t crawl lagou www.lagou.com # 不指定初始化模板,默认的是用basic模板 scrapy genspider lagou www.lagou.com
通过crawl 新建爬虫:
scrapy genspider -t crawl lagou www.lagou.com
此时,生成lagou.py文件,lagou.py文件内
LagouSpider(CrawlSpider) ,即继承于 CrawlSpider ,不再是basic模板的 scrapy.Spider了。(注意CrawlSpider继承于scrapy下的Spider)
lagou.py:
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class LagouSpider(CrawlSpider): name = 'lagou' allowed_domains = ['www.lagou.com'] start_urls = ['http://www.lagou.com/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) def parse_item(self, response): i = {} #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() #i['name'] = response.xpath('//div[@id="name"]').extract() #i['description'] = response.xpath('//div[@id="description"]').extract() return i
关于CrawlSpider全站式爬取数据-相关介绍,请参考此链接:
https://www.cnblogs.com/Eric15/p/9941197.html
使用CrawlSpider爬取拉钩网数据-测试
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class LagouSpider(CrawlSpider): name = 'lagou' allowed_domains = ['www.lagou.com'] start_urls = ['http://www.lagou.com/'] rules = ( # 3个规则 Rule(LinkExtractor(allow=r'zhaopin/.*/'), follow=True),#爬取zhaopin下的所有url Rule(LinkExtractor(allow=r'gongsi/\d+.html/'), follow=True),#爬取gongsi下的所有url Rule(LinkExtractor(allow=r'jobs/\d+.html/'), callback='parse_job', follow=True),#爬取jobs下的所有url ) def parse_job(self, response): i = {} #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() #i['name'] = response.xpath('//div[@id="name"]').extract() #i['description'] = response.xpath('//div[@id="description"]').extract() return i
设置断点并进行debug运行
爬取数据结果:
link:
Link(url='https://www.lagou.com/zhaopin/Java/', text='Java', fragment='', nofollow=False) # 每个url、text(内容)
links:
# 爬取的url → 属于https://www.lagou.com/zhaopin/下的所有url <class 'list'>: [Link(url='https://www.lagou.com/zhaopin/Java/', text='Java', fragment='', nofollow=False),
Link(url='https://www.lagou.com/zhaopin/PHP/', text='PHP', fragment='', nofollow=False),
Link(url='https://www.lagou.com/zhaopin/C++/', text='C++', fragment='', nofollow=False),
Link(url='https://www.lagou.com/zhaopin/qukuailian/', text='区块链', fragment='', nofollow=False), ......
response:
# https://www.lagou.com/:爬取的url页面 <200 https://www.lagou.com/>
seen:
# 将爬取到的所有符合规则的url都放到seen集合中去 {Link(url='https://www.lagou.com/zhaopin/Java/', text='Java', fragment='', nofollow=False)}
关于Rules多个规则时爬取顺序:scrapy爬虫是使用异步机制,使用规则爬取数据时是没有区分先后顺序的
使用CrawlSpider全站式爬取拉钩网数据程序流程:
1、首先,爬取主页(首页:https://www.lagou.com,假设这个是第一层url,只有一个url)数据,其中包含所有url链接(都存于response中)
→2、判断每个url(假设这是第二层url,第一层url爬取到的页面数据下的所有url)是否符合Rules用户自定义的规则,如果符合则爬取该条url数据(Rules没有执行顺序之分,只要符合Rules规则中的任意一条,就会按该条Rule规则进行request爬取数据)数据爬取完成后调用其对应的callback函数
→3、接着判断follow是否为True,如果为True,之后会在接着深度式爬取第三层url(在第二层每个url页面的url链接);为False则不再进一步数据爬取
→4、最后,通过对item数据的处理,将其入库即完成了拉勾网数据爬取任务
follow注意的一个点,这个点在我的随便CrawlSpider源码介绍中没提到,需要注意:使用CrawlSpider爬取爬虫数据时,在Rules规则中假如我们设定了某个Rule规则的follow为False,则表示会在第二层url数据爬取后便不再进一步爬取,而不是爬取完第一层(首页)数据后便停止下一层数据提取。原因如下(源码分析):
爬虫开始时我们调用的是start_request函数,之后调用的是parse函数,而parse函数传递的参数中有一个‘follow = True‘,程序要判断是否进行下一层数据爬取(跟进)时,是根据_parse_response函数下的这句代码:
if follow and self._follow_links:
来判断是否进行下一层数据爬取的。实际上第一次传进去的follow是parse函数中传递的参数:follow = True ,此时无论Rules规则中的参数follow是否为False,此处判断都为True,即会进行下一层数据爬取(第二层)。直到下一次调用此函数(_parse_response)时,传进来的follow值才是Rule规则中用户自定义的follow值。此时才会终止爬虫的下一步数据爬取(follow = False)
最后一个问题
在爬取拉钩网url中,出现302重定向问题(要求登录)。此时我们可以通过:custom_settings,对一些default设置进行预设。爬取拉钩网时,需要登录获取cookies或者直接手动将cookies放置custom_settings中。
经过测试,爬取拉钩网时必须的两个参数分别是:cookies 、 User-Agent 。我们将其配置到custom_settings中就可以了。有多种方式:
1、直接手动将cookies添加进custom_settings中,浏览器访问(get请求就可以了),检查代码复制cookies数据及User_Agent数据到custom_settings中就可以了:
# cookies、User_Agent必须 custom_settings = { "COOKIES_ENABLED": False, # "DOWNLOAD_DELAY": 1, # 延时/秒 'DEFAULT_REQUEST_HEADERS': { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8', 'Connection': 'keep-alive', 'Cookie': 'JSESSIONID=ABAAABAABEEAAJAF08A698E7D4CC5B1B474ED6DDA70F780; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541953902; _ga=GA1.2.1358601872.1541953903; user_trace_token=20181112003151-4b03ac83-e5cf-11e8-8882-5254005c3644; LGSID=20181112003151-4b03aeb8-e5cf-11e8-8882-5254005c3644; LGUID=20181112003151-4b03b056-e5cf-11e8-8882-5254005c3644; _gid=GA1.2.1875637681.1541953903; index_location_city=%E5%B9%BF%E5%B7%9E; TG-TRACK-CODE=index_navigation; SEARCH_ID=6edac9dbb3714a8780795d56ccdc7f78; LGRID=20181112013543-36e69b49-e5d8-11e8-888e-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541957735', 'Host': 'www.lagou.com', 'Origin': 'https://www.lagou.com', 'Referer': 'https://www.lagou.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } }
2、使用selenium登录获取cookies,写入文件,再添加到 custom_settings 中就可以了,这种是自动型,比较推荐。关于这种方式读者需要自己去尝试,本文目前只在于测试没必要用到第二种故没特地测试处理。给段网上代码可以参考
from selenium import webdriver from scrapy.selector import Selector import time def login_lagou(): browser = webdriver.Chrome(executable_path="D:/chromedriver.exe") browser.get("https://passport.lagou.com/login/login.html") # 填充账号密码 browser\ .find_element_by_css_selector("body > section > div.left_area.fl > div:nth-child(2) > form > div:nth-child(1) > input")\ .send_keys("username") browser\ .find_element_by_css_selector("body > section > div.left_area.fl > div:nth-child(2) > form > div:nth-child(2) > input")\ .send_keys("password") # 点击登陆按钮 browser\ .find_element_by_css_selector("body > section > div.left_area.fl > div:nth-child(2) > form > div.input_item.btn_group.clearfix > input")\ .click() cookie_dict={} time.sleep(3) Cookies = browser.get_cookies() for cookie in Cookies: cookie_dict[cookie['name']] = cookie['value'] # browser.quit() return cookie_dict
另外还有通过requests第三方库登录,及使用scrapy自带request模拟登录都是可以的。方式有多种,有兴趣的可以自己尝试。
测试代码:
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class LagouSpider(CrawlSpider): name = 'lagou' allowed_domains = ['www.lagou.com'] start_urls = ['http://www.lagou.com/'] rules = ( Rule(LinkExtractor(allow=r'zhaopin/.*/'), follow=False), # Rule(LinkExtractor(allow=r'gongsi/\d+.html/'), follow=False), Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True), ) custom_settings = { "COOKIES_ENABLED": False, # "DOWNLOAD_DELAY": 1, 'DEFAULT_REQUEST_HEADERS': { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8', 'Connection': 'keep-alive', 'Cookie': 'JSESSIONID=ABAAABAABEEAAJAF08A698E7D4CC5B1B474ED6DDA70F780; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541953902; _ga=GA1.2.1358601872.1541953903; user_trace_token=20181112003151-4b03ac83-e5cf-11e8-8882-5254005c3644; LGSID=20181112003151-4b03aeb8-e5cf-11e8-8882-5254005c3644; LGUID=20181112003151-4b03b056-e5cf-11e8-8882-5254005c3644; _gid=GA1.2.1875637681.1541953903; index_location_city=%E5%B9%BF%E5%B7%9E; TG-TRACK-CODE=index_navigation; SEARCH_ID=6edac9dbb3714a8780795d56ccdc7f78; LGRID=20181112013543-36e69b49-e5d8-11e8-888e-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541957735', 'Host': 'www.lagou.com', 'Origin': 'https://www.lagou.com', 'Referer': 'https://www.lagou.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } } # headers = { # "HOST": "www.lagou.com", # 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36" # } # # def _build_request(self, rule, link): # r = Request(url=link.url, callback=self._response_downloaded,headers=self.headers) # r.meta.update(rule=rule, link_text=link.text) # return r def parse_job(self, response): i = {} #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() #i['name'] = response.xpath('//div[@id="name"]').extract() #i['description'] = response.xpath('//div[@id="description"]').extract() return i
二、CrawlSpider爬取拉勾网实战
新建lagou.py爬虫项目后,接下来就是对爬取目标的定位了,我们的目标是爬取拉勾网上每个招聘信息中以下的数据:
class LagouJobItem(scrapy.Item): # 拉勾网职位信息 title = scrapy.Field() # 标题 url = scrapy.Field() url_object_id = scrapy.Field() # url+md5加密 salary = scrapy.Field() # 薪资 job_city = scrapy.Field() # 工作城市 work_years = scrapy.Field() # 工作年限 degree_need = scrapy.Field() # 工作经验 job_type = scrapy.Field() # 工作类型(全职/兼职) publish_time = scrapy.Field() # 发布时间 job_advantage = scrapy.Field() # 职位诱惑 job_desc = scrapy.Field() # 工作描述 job_addr = scrapy.Field() # 工作地点 company_name = scrapy.Field() # 公司名称 company_url = scrapy.Field() # 公司url tags = scrapy.Field() # 职位标签 crawl_time = scrapy.Field() # 爬取时间
1、在items.py 中自定义ItemLoader:
class LagouJobItemLoader(ItemLoader): # 自定义拉钩ItemLoader default_output_processor = TakeFirst()
2、在lagou.py中进行相应数据的爬取:
def parse_job(self, response): # 解析拉勾网的职位 item_loader = LagouJobItemLoader(item=LagouJobItem(),response=response) item_loader.add_css("title",".job-name::attr(title)") item_loader.add_value("url",response.url) item_loader.add_value("url_object_id",get_md5(response.url)) item_loader.add_css("salary",".job_request .salary::text") item_loader.add_css("job_city",".job_request span:nth-child(2)::text") # 取到span标签的第二个(span标签) item_loader.add_css("work_years",".job_request span:nth-child(3)::text") item_loader.add_css("degree_need",".job_request span:nth-child(4)::text") item_loader.add_css("job_type",".job_request span:nth-child(5)::text") item_loader.add_css("tags",".position-label li::text") item_loader.add_css("publish_time",".publish_time::text") item_loader.add_css("job_advantage",".job-advantage p::text") item_loader.add_css("job_desc",".job_bt div") item_loader.add_css("job_addr",".work_addr") item_loader.add_css("company_name","#job_company dt a img::attr(alt)") item_loader.add_css("company_url","#job_company dt a::attr(href)") item_loader.add_value("crawl_time",datetime.now()) lagou_job_item = item_loader.load_item() return lagou_job_item
然后,尝试debug对数据进行爬取,item结果如下:
3、在items.py中对数据进一步修改调整:
def remove_splash(value): # 去除 / return value.replace("/","") def time_split(value): # 根据/分割,返回时间点 publish_time: 13:55 发布于拉勾网 value_list = value.split(" ") return value_list[0] def get_word_year(value): # 获取工作年限 match_re = re.match(".*?((\d+)-?(\d*)).*", value) if match_re: word_year = match_re.group(1) else: word_year = "经验不限" return word_year def get_job_addr(value): # 拼接地址,并去除无用信息 addr_list = value.split("\n") addr_list = [item.strip() for item in addr_list if item.strip() != '查看地图'] return "".join(addr_list) def get_job_desc(value): # 拼接招聘内容描述 desc_list = value.split("\n") desc_list = [item.strip() for item in desc_list] return "".join(desc_list)
在item.py中的 LagouJobItem类应用:
# 拉钩网爬取相关item class LagouJobItemLoader(ItemLoader): # 自定义拉钩ItemLoader default_output_processor = TakeFirst() class LagouJobItem(scrapy.Item): # 拉勾网职位信息 title = scrapy.Field() # 标题 url = scrapy.Field() url_object_id = scrapy.Field() # url+md5加密 salary = scrapy.Field() # 薪资 job_city = scrapy.Field( # 工作城市 input_processor=MapCompose(remove_splash) ) work_years = scrapy.Field( # 工作年限 input_processor=MapCompose(get_word_year) ) degree_need = scrapy.Field( # 工作经验 input_processor=MapCompose(remove_splash) ) job_type = scrapy.Field() # 工作类型(全职/兼职) publish_time = scrapy.Field( # 发布时间 input_processor=MapCompose(time_split) ) job_advantage = scrapy.Field() # 职位诱惑 job_desc = scrapy.Field( # 工作描述 input_processor=MapCompose(remove_tags,get_job_desc) ) job_addr = scrapy.Field( # 工作地点 input_processor = MapCompose(remove_tags,get_job_addr) ) company_name = scrapy.Field() # 公司名称 company_url = scrapy.Field() # 公司url tags = scrapy.Field( # 职位标签 input_processor=Join("-") ) crawl_time = scrapy.Field( # 爬取时间 )
再进行debug调试,爬取到的是我们理想中的数据类型:
4、接着就是将爬取到的数据存入数据库中了
1)首先,我们需要先建表,及填写数据类型
表名:lagou_job
2)开始数据库入库的相关操作
我们之前在pipelines.py进行数据入库操作时,是直接在MysqlTwistedPipeline类中进行数据增删改查相关操作的,其实这样已经把入库操作写死。在一个爬虫项目中,当你需要对多个网站爬取数据,存储数据时每个网站的数据都需要进行入库处理,此时上述的入库操作便不适用了。
修订数据入库操作其实很简单,异步入库数据库的基础设置都一样,主要就是增删改查不同,我们将这部分操作移到每个爬虫项目对应的Item中进行操作即可。具体操作如下:
旧版数据入库:
import MySQLdb import MySQLdb.cursors from twisted.enterprise import adbapi class MysqlTwistedPipeline(object): def __init__(self,dbpool): self.dbpool = dbpool @classmethod def from_settings(cls,settings): # 用于读取配置文件信息,先于process_item调用 dbparm = dict( host=settings["MYSQL_HOST"], db=settings["MYSQL_DBNAME"], user=settings["MYSQL_USER"], passwd=settings["MYSQL_PASSWORD"], charset='utf8', cursorclass=MySQLdb.cursors.DictCursor, # 字典类型,还有一种json类型 use_unicode=True, ) dbpool = adbapi.ConnectionPool("MySQLdb",**dbparm) # tadbapi.ConnectionPool:wisted提供的一个用于异步化操作的连接处(容器)。将数据库模块,及连接数据库的参数等传入即可连接mysql return cls(dbpool) # 实例化 MysqlTwistedPipeline def process_item(self, item, spider): # 操作数据时调用 query = self.dbpool.runInteraction(self.sql_insert, item) # 执行mysql语句相应操作 ,异步操作 query.addErrback(self.handle_error, item, spider) # 异常处理 def handle_error(self, failure, item, spider): # 处理异常 print("异常:", failure) def sql_insert(self, cursor, item): # 数据插入操作 insert_sql = """ insert into article_spider(title, url, create_date, fav_nums,url_object_id) VALUES (%s, %s, %s, %s,%s) """ cursor.execute(insert_sql, (item["title"], item["url"], item["create_date"], item["fav_nums"], item["url_object_id"]))
新版:
首先,将数据增删改查操作相关转移到对应item对象中操作:
# items.py/LagouJobItem def get_insert_sql(self): insert_sql = """ insert into lagou_job(title, url, url_object_id, salary, job_city, work_years, degree_need, job_type, publish_time, job_advantage, job_desc, job_addr, company_name, company_url, tags, crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE salary=VALUES(salary), job_desc=VALUES(job_desc) """ params = ( self["title"], self["url"], self["url_object_id"], self["salary"], self["job_city"], self["work_years"], self["degree_need"], self["job_type"], self["publish_time"], self["job_advantage"], self["job_desc"], self["job_addr"], self["company_name"], self["company_url"], self["job_addr"], self["crawl_time"].strftime(SQL_DATETIME_FORMAT), ) return insert_sql, params
ON DUPLICATE KEY UPDATE解析:当插入数据时,该条数据主键存在(冲突),则进行更新数据操作。
接着,更改MysqlTwistedPipline类中的do_insert函数:
def do_insert(self, cursor, item): #执行具体的插入 #根据不同的item 构建不同的sql语句并插入到mysql中 insert_sql, params = item.get_insert_sql() cursor.execute(insert_sql, params)
setting.py中设置:
ITEM_PIPELINES = { 'ArticleSpider.pipelines.MysqlTwistedPipeline': 4, # 方式三:异步数据库保存item数据 }
使用debug进行调试测试,查看lago_job表数据,发现数据正不断存入:
使用CrawlSpider爬取拉勾网数据-完整代码
1、setting.py
import os BOT_NAME = 'ArticleSpider' SPIDER_MODULES = ['ArticleSpider.spiders'] NEWSPIDER_MODULE = 'ArticleSpider.spiders' ROBOTSTXT_OBEY = False # Enble or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'ArticleSpider.middlewares.ArticlespiderSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'ArticleSpider.middlewares.ArticlespiderDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { # 'ArticleSpider.pipelines.ArticlespiderPipeline': 300, # 'ArticleSpider.pipelines.ArticleImagePipeline': 2, # 用于图片下载时调用 # 'ArticleSpider.pipelines.JsonWithEncodingPipeline': 3, # 方式一:用于保存item 数据 ,在图片下载之后再调用 'ArticleSpider.pipelines.MysqlTwistedPipeline': 4, # 方式三:异步数据库保存item数据 # 'ArticleSpider.pipelines.JsonExporterPipleline': 3, # 方式二:使用scrapy提供的JsonItemExporter保存json文件,用于保存item 数据 # 'scrapy.pipelines.images.ImagesPipeline':1 # scrapy中的pipelines自带的ImagesPipeline,用于图片下载,另外还有图片、媒体下载 } BASE_DIR = os.path.dirname(os.path.abspath(__file__)) IMAGES_STORE = os.path.join(BASE_DIR,'images') # 名称是固定写法,文件保存路径 IMAGES_URLS_FIELD = "acticle_image_url" # 名称是固定写法。设定acticle_image_url字段为图片url,下载图片时找此字段对应的数据 ITEM_DATA_DIR = os.path.join(BASE_DIR,"item_data") # item数据保存到当地item_data文件夹 # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' # mysql配置 MYSQL_HOST = "127.0.0.1" MYSQL_DBNAME = "article_spider" MYSQL_USER = "root" MYSQL_PASSWORD = "0315" SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S" SQL_DATE_FORMAT = "%Y-%m-%d"
2、common.py
# md5 加密 import hashlib def get_md5(url): if isinstance(url,str): # python3中Unicode即是str url = url.encode("utf-8") m = hashlib.md5() m.update(url) return m.hexdigest()
3、items.py
from scrapy.loader import ItemLoader from scrapy.loader.processors import MapCompose, TakeFirst, Join from ArticleSpider.settings import SQL_DATETIME_FORMAT, SQL_DATE_FORMAT import re from w3lib.html import remove_tags # 用于去除HTML标签 def return_value(value): return value def remove_splash(value): return value.replace("/","") def time_split(value): # 根据空格分割,返回时间点 publish_time: 13:55 发布于拉勾网 value_list = value.split(" ") return value_list[0] def get_word_year(value): # 获取工作年限 match_re = re.match(".*?((\d+)-?(\d*)).*", value) if match_re: word_year = match_re.group(1) else: word_year = "经验不限" return word_year def get_job_addr(value): # 拼接地址,并去除无用信息 addr_list = value.split("\n") addr_list = [item.strip() for item in addr_list if item.strip() != '查看地图'] return "".join(addr_list) def get_job_desc(value): # 拼接招聘内容描述 desc_list = value.split("\n") desc_list = [item.strip() for item in desc_list] return "".join(desc_list) # 拉钩网爬取相关item class LagouJobItemLoader(ItemLoader): # 自定义拉钩ItemLoader default_output_processor = TakeFirst() class LagouJobItem(scrapy.Item): # 拉勾网职位信息 title = scrapy.Field() # 标题 url = scrapy.Field() url_object_id = scrapy.Field() # url+md5加密 salary = scrapy.Field() # 薪资 job_city = scrapy.Field( # 工作城市 input_processor=MapCompose(remove_splash) ) work_years = scrapy.Field( # 工作年限 input_processor=MapCompose(get_word_year) ) degree_need = scrapy.Field( # 工作经验 input_processor=MapCompose(remove_splash) ) job_type = scrapy.Field() # 工作类型(全职/兼职) publish_time = scrapy.Field( # 发布时间 input_processor=MapCompose(time_split) ) job_advantage = scrapy.Field() # 职位诱惑 job_desc = scrapy.Field( # 工作描述 input_processor=MapCompose(remove_tags,get_job_desc) ) job_addr = scrapy.Field( # 工作地点 input_processor = MapCompose(remove_tags,get_job_addr) ) company_name = scrapy.Field() # 公司名称 company_url = scrapy.Field() # 公司url tags = scrapy.Field( # 职位标签 input_processor=Join("-") ) crawl_time = scrapy.Field( # 爬取时间 ) def get_insert_sql(self): insert_sql = """ insert into lagou_job(title, url, url_object_id, salary, job_city, work_years, degree_need, job_type, publish_time, job_advantage, job_desc, job_addr, company_name, company_url, tags, crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE salary=VALUES(salary), job_desc=VALUES(job_desc) """ params = ( self["title"], self["url"], self["url_object_id"], self["salary"], self["job_city"], self["work_years"], self["degree_need"], self["job_type"], self["publish_time"], self["job_advantage"], self["job_desc"], self["job_addr"], self["company_name"], self["company_url"], self["job_addr"], self["crawl_time"].strftime(SQL_DATETIME_FORMAT), ) return insert_sql, params
4、pipelines.py
import MySQLdb import MySQLdb.cursors from twisted.enterprise import adbapi class MysqlTwistedPipeline(object): def __init__(self,dbpool): self.dbpool = dbpool @classmethod def from_settings(cls,settings): # 用于读取配置文件信息,先于process_item调用 dbparm = dict( host=settings["MYSQL_HOST"], db=settings["MYSQL_DBNAME"], user=settings["MYSQL_USER"], passwd=settings["MYSQL_PASSWORD"], charset='utf8', cursorclass=MySQLdb.cursors.DictCursor, # 字典类型,还有一种json类型 use_unicode=True, ) dbpool = adbapi.ConnectionPool("MySQLdb",**dbparm) # tadbapi.ConnectionPool:wisted提供的一个用于异步化操作的连接处(容器)。将数据库模块,及连接数据库的参数等传入即可连接mysql return cls(dbpool) # 实例化 MysqlTwistedPipeline def process_item(self, item, spider): # 操作数据时调用 query = self.dbpool.runInteraction(self.do_insert, item) # 执行mysql语句相应操作 ,异步操作 query.addErrback(self.handle_error, item, spider) # 异常处理 def handle_error(self, failure, item, spider): #处理异步插入的异常 print (failure) def do_insert(self, cursor, item): #执行具体的插入 #根据不同的item 构建不同的sql语句并插入到mysql中 insert_sql, params = item.get_insert_sql() cursor.execute(insert_sql, params)
5、lagou.py
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from ArticleSpider.items import LagouJobItem,LagouJobItemLoader from ArticleSpider.util.common import get_md5 from datetime import datetime class LagouSpider(CrawlSpider): name = 'lagou' allowed_domains = ['www.lagou.com'] start_urls = ['http://www.lagou.com/'] rules = ( Rule(LinkExtractor(allow=r'zhaopin/.*/'), follow=True), Rule(LinkExtractor(allow=r'gongsi/\d+.html/'), follow=True), Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True), ) custom_settings = { "COOKIES_ENABLED": False, "DOWNLOAD_DELAY": 3, 'DEFAULT_REQUEST_HEADERS': { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8', 'Connection': 'keep-alive', 'Cookie': '_ga=GA1.2.1358601872.1541953903; user_trace_token=20181112003151-4b03ac83-e5cf-11e8-8882-5254005c3644; LGUID=20181112003151-4b03b056-e5cf-11e8-8882-5254005c3644; _gid=GA1.2.1875637681.1541953903; index_location_city=%E5%B9%BF%E5%B7%9E; JSESSIONID=ABAAABAAAGGABCB641A801FD52253622370040445465BDC; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541953902,1541989806; TG-TRACK-CODE=index_navigation; SEARCH_ID=0ee1c4af2c2d47dc84be450da8c8c8fc; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541992192; LGRID=20181112111000-70d55352-e628-11e8-9b85-525400f775ce', 'Host': 'www.lagou.com', 'Origin': 'https://www.lagou.com', 'Referer': 'https://www.lagou.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } } def parse_job(self, response): # 解析拉勾网的职位 item_loader = LagouJobItemLoader(item=LagouJobItem(),response=response) item_loader.add_css("title",".job-name::attr(title)") item_loader.add_value("url",response.url) item_loader.add_value("url_object_id",get_md5(response.url)) item_loader.add_css("salary",".job_request .salary::text") item_loader.add_css("job_city",".job_request span:nth-child(2)::text") # 取到span标签的第二个(span标签) item_loader.add_css("work_years",".job_request span:nth-child(3)::text") item_loader.add_css("degree_need",".job_request span:nth-child(4)::text") item_loader.add_css("job_type",".job_request span:nth-child(5)::text") item_loader.add_css("tags",".position-label li::text") item_loader.add_css("publish_time",".publish_time::text") item_loader.add_css("job_advantage",".job-advantage p::text") item_loader.add_css("job_desc",".job_bt div") item_loader.add_css("job_addr",".work_addr") item_loader.add_css("company_name","#job_company dt a img::attr(alt)") item_loader.add_css("company_url","#job_company dt a::attr(href)") item_loader.add_value("crawl_time",datetime.now()) lagou_job_item = item_loader.load_item() return lagou_job_item
6、main.py
import os,sys sys.path.append(os.path.dirname(os.path.abspath(__file__))) # 将父路径添加至sys path中 execute(['scrapy','crawl','lagou',]) # 执行:scrapy crawl lagou。 其中'lagou'是lagou.py中JobboleSpider类的name字段数据
三、Scrapy突破反爬虫的限制
1、基础知识
爬虫:自动获取数据的程序,关键是批量的获取
反爬虫:使用技术手段防止爬虫程序的方法
误伤:反爬虫技术将普通用户识别为爬虫,效果再好也不能用
成本:反爬虫人力和机器成本
拦截:拦截率越高,误伤率越高
反爬虫的目的:
爬虫与反爬虫的对抗过程:
2、通过downloadmiddleware随机更换user-agent
scrapy框架中有帮我们默认实现了默认的user-agent(默认为scrapy),我们要实现自定义随机更换user-agent就需要自定义UserAgentMiddleware。首先,在setting中开启
DOWNLOADER_MIDDLEWARES 的相关配置:
DOWNLOADER_MIDDLEWARES = { 'ArticleSpider.middlewares.ArticlespiderDownloaderMiddleware': 543, }
接着,将scrapy自带的UserAgentMiddleware设置为None:
DOWNLOADER_MIDDLEWARES = { 'ArticleSpider.middlewares.MyCustomDownloaderMiddleware': 543, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None # 此行,避免使用了默认的user-agent middleware }
然后,在middleware.py文件中,我们新建一个类:RandomUserAgentMiddlware
在这之前,我们需要在GitHub下载:fake-useragent 包, fake-useragent维护了很多user-agent版本,具体自行查看fake-useragent介绍。
首先,安装:pip install fake-useragent , 然后在项目中引用就行了。
RandomUserAgentMiddlware类代码:
from fake_useragent import UserAgent # 引入fake-useragent的UserAgent class RandomUserAgentMiddlware(object): #随机更换user-agent def __init__(self, crawler): super(RandomUserAgentMiddlware, self).__init__() self.ua = UserAgent() # 实例化UserAgent self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random") # 从setting中获取useragent的类型(Firefox、Chrome、IE或random) @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): def get_ua(): return getattr(self.ua, self.ua_type) # 根据setting中获取的useragent类型,映射真正方法 request.headers.setdefault('User-Agent', get_ua()) # 添加到headers中
setting中配置:
1)useragent类型:
RANDOM_UA_TYPE = "random"
2)在DOWNLOADER_MIDDLEWARES中配置:
DOWNLOADER_MIDDLEWARES = { # 'ArticleSpider.middlewares.ArticlespiderDownloaderMiddleware': 543, 'ArticleSpider.middlewares.RandomUserAgentMiddlware': 543, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, }
自此,便实现了user-agent的随机切换
上述操作完成后,在debug或运行时如果报错:fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached ,可分别尝试下述方法:
- ua = UserAgent(use_cache_server=False)(如果不想使用宿主缓存服务器,可以禁用服务器缓存)
- ua = UserAgent(cache=False)(如果不希望缓存数据库或不需要可写文件系统)
如果上面两种方法都不行,执行:
ua = UserAgent(verify_ssl=False)
由于 fake-useragent 库维护的 user-agent 列表存放在在线网页上,过低版本依赖的列表网页可能就会报 404
随手更新:
ua.update()
查看全部user-agent:
ua.data_browsers
重新运行,在debug模式下,可以看到随机获取到一个user-agent版本,添加进headers:
3、使用西刺创建ip代理池,实现ip代理
西刺ip代理:http://www.xicidaili.com
ip动态变化:重启路由器等
ip代理的原理:不直接发送自己真实ip,而使用中间代理商(代理服务器),那么服务器不知道我们的ip也就不会把我们禁掉
测试:使用ip代理很简单,只需要在上面我们自定义的RandomUserAgentMiddlware类中的process_request函数加上一行代码:
request.meta["proxy"] = "http://118.190.95.35:9001" # 使用的是西刺代理ip:118.190.95.35 ,端口:9001 ,类型:HTTP
这样,当爬虫向每个url爬取数据时,都会通过ip代理的形式向服务器发送请求。
上述只是简单实现ip代理模式,实际上使用单个ip代理也是很容易被发现,所以说应该采取类似上面使用随机useragent的方式,来实现随机ip代理,这样就大大降低了被反爬发现的几率。
首先,需要新建个脚本,将西刺代理上相关ip代理数据爬取到我们服务器的文件或数据库中
新建tools包:用来存放脚本文件
在tools包中新建脚本:crawl_xici_ip.py脚本文件,用于爬取西刺ip代理数据(ip、端口、协议类型、响应时间等),将数据存至数据库并调用(实现ip代理池)
1)爬取数据
import requests from scrapy.selector import Selector headers = { "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" } for i in range(200): rep = requests.get("http://www.xicidaili.com/nn/{0}".format(i),headers=headers) # print(rep) selector = Selector(text=rep.text) # 将返回的response响应数据(文本信息)传给Selector all_trs = selector.css("#ip_list tr") ip_list = [] for tr in all_trs[1:]: # 获取西刺网ip及端口等相关数据 ip = tr.css("td:nth-child(2)::text").extract_first('') port = tr.css("td:nth-child(3)::text").extract_first('') anony_type = tr.css("td:nth-child(5)::text").extract_first('') proxy_type = tr.css("td:nth-child(6)::text").extract_first('') speed_str = tr.css("td:nth-child(8) div::attr(title)").extract_first('') if speed_str: speed = float(speed_str.split("秒")[0]) else: speed = 9999.0 ip_list.append((ip,port,anony_type,proxy_type,speed)) # 存入list
2)存入数据库
import MySQLdb conn = MySQLdb.connect(host = "127.0.0.1",user = "root",passwd = "*******",db = "article_spider",charset = "utf8") cursor = conn.cursor() for ip_info in ip_list: cursor.execute( "insert into proxy_ip (ip,port,anony_type,proxy_type,speed) values('{0}','{1}','{2}','{3}',{4})".format( ip_info[0],ip_info[1],ip_info[2],ip_info[3],ip_info[4] ) ) conn.commit()
3)从数据库中取数据,取出数据后对数据(ip 、端口)进行测试,如果可用再返回,不可用则删除该条数据,重新取数据,如此循环
class GetIP(object): def delete_ip(self,ip): # 从数据库删除无效的ip delete_sql = """ delete from proxy_ip where ip='{0}' """.format(ip) cursor.execute(delete_sql) conn.commit() return True def judge_ip(self,ip,port): # 使用代理模式访问百度,测试ip是否可用 http_url = "http://www.baidu.com" proxy_url = "http://{0}:{1}".format(ip,port) # 代理ip设置 try: proxy_dict = { "http":proxy_url, } response = requests.get(http_url,proxies = proxy_dict) # proxies要去传入的是个dict类型,键值对类型:"http":"http://www.baidu.com"等 except Exception as e: print("Invalid ip and port") self.delete_ip(ip) return False else: code = response.status_code if code >=200 and code <300: print("Effective ip") return True else: print("Invalid ip and port") self.delete_ip(ip) return False def get_random_ip(self): # 随机获取mysql中某条数据的ip及端口 random_sql = """ select ip, port from proxy_ip where proxy_type='http' order by RAND() limit 1 """ result = cursor.execute(random_sql) for ip_info in cursor.fetchall(): ip = ip_info[0] port = ip_info[1] judge_re = self.judge_ip(ip,port) if judge_re: # 测试通过,表示该端口及ip可用,直接return即可 return "http://{0}:{1}".format(ip,port) else: return self.get_random_ip() # 测试失败,ip无效,重新获取随机ip
crawl_xici_ip.py完整代码:
import requests from scrapy.selector import Selector import MySQLdb conn = MySQLdb.connect(host = "127.0.0.1",user = "root",passwd = "0315",db = "article_spider",charset = "utf8") cursor = conn.cursor() def crawl_ips(): # 爬取西刺ip信息相关数据,并存入数据库 headers = { "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" } for i in range(200): rep = requests.get("http://www.xicidaili.com/nn/{0}".format(i),headers=headers) # print(rep) selector = Selector(text=rep.text) # 将返回的response响应数据(文本信息)传给Selector all_trs = selector.css("#ip_list tr") ip_list = [] for tr in all_trs[1:]: # 获取西刺网ip及端口等相关数据 ip = tr.css("td:nth-child(2)::text").extract_first('') port = tr.css("td:nth-child(3)::text").extract_first('') anony_type = tr.css("td:nth-child(5)::text").extract_first('') proxy_type = tr.css("td:nth-child(6)::text").extract_first('') speed_str = tr.css("td:nth-child(8) div::attr(title)").extract_first('') if speed_str: speed = float(speed_str.split("秒")[0]) else: speed = 9999.0 ip_list.append((ip,port,anony_type,proxy_type,speed)) # 存入list for ip_info in ip_list: cursor.execute( "insert into proxy_ip (ip,port,anony_type,proxy_type,speed) values('{0}','{1}','{2}','{3}',{4})".format( ip_info[0],ip_info[1],ip_info[2],ip_info[3],ip_info[4] ) ) conn.commit() class GetIP(object): def delete_ip(self,ip): # 从数据库删除无效的ip delete_sql = """ delete from proxy_ip where ip='{0}' """.format(ip) cursor.execute(delete_sql) conn.commit() return True def judge_ip(self,ip,port): # 使用代理模式访问百度,测试ip是否可用 http_url = "http://www.baidu.com" proxy_url = "http://{0}:{1}".format(ip,port) # 代理ip设置 try: proxy_dict = { "http":proxy_url, } response = requests.get(http_url,proxies = proxy_dict) # proxies要去传入的是个dict类型,键值对类型:"http":"http://www.baidu.com"等 except Exception as e: print("Invalid ip and port") self.delete_ip(ip) return False else: code = response.status_code if code >=200 and code <300: print("Effective ip") return True else: print("Invalid ip and port") self.delete_ip(ip) return False def get_random_ip(self): # 随机获取mysql中某条数据的ip及端口 random_sql = """ select ip, port from proxy_ip where proxy_type='http' order by RAND() limit 1 """ result = cursor.execute(random_sql) for ip_info in cursor.fetchall(): ip = ip_info[0] port = ip_info[1] judge_re = self.judge_ip(ip,port) if judge_re: # 测试通过,表示该端口及ip可用,直接return即可 return "http://{0}:{1}".format(ip,port) else: return self.get_random_ip() # 测试失败,ip无效,重新获取随机ip
直接在RandomUserAgentMiddlware中使用获取随机ip代理:
from ArticleSpider.tools.crawl_xici_ip import GetIP # 引人tools/crawl_xici_ip.py中自定义的脚本 def process_request(self, request, spider): def get_ua(): return getattr(self.ua, self.ua_type) # 根据setting中获取的useragent类型,映射真正方法 request.headers.setdefault('User-Agent', get_ua()) # 添加到headers中 request.meta["proxy"] = self.get_ip.get_random_ip() # 使用ip代理池,实现随机ip代理
如此,使用西刺网相关数据实现ip代理便完成了,但实际上西刺ip免费代理是不稳定的,如果有必要,还是建议使用收费版的ip代理。
使用收费版ip代理推荐
相关配置:
1)安装:
pip install scrapy-crawlera
2)相关配置(setting.py):
DOWNLOADER_MIDDLEWARES = { ... 'scrapy_crawlera.CrawleraMiddleware': 610 } CRAWLERA_ENABLED = True CRAWLERA_APIKEY = 'apikey' # apikey:需要我们到官网注册,会获取到这个值,不过目前已经是收费的了
2.1)如果不用第二种方式(setting设置),也可以使用这种方式,在spider项目中使用:
class MySpider: crawlera_enabled = True crawlera_apikey = 'apikey'
3)在项目中使用:
scrapy.Request( 'http://example.com', headers={ 'X-Crawlera-Max-Retries': 1, # 此行代码实现 ... }, )
2、tor 洋葱网络
可以将我们的ip多次包装,让服务端查不到我们的ip。需要VIP才能下载使用,有兴趣可以自己网上搜索相关教程查看。