Python分布式爬虫打造搜索引擎(三)

Python分布式爬虫打造搜索引擎


 

一、通过CrawlSpider对招聘网站进行整站爬取

1、创建拉勾网爬虫项目 - CrawlSpider的使用

推荐工具:cmder , 下载地址:http://cmder.net/     → 下载full版本,使我们在windows环境下也可以使用linux部分命令

在终端/cmder中,进入我们项目,执行:scrapy genspider --list :查看可使用的初始化版本

ailable templates:

basic       # 
crawl       # 
csvfeed     # 
xmlfeed     # 

# 执行命令:-t 表示通过模板生成
scrapy genspider -t crawl lagou www.lagou.com

# 不指定初始化模板,默认的是用basic模板
scrapy genspider lagou www.lagou.com

 

通过crawl 新建爬虫:

scrapy genspider -t crawl lagou www.lagou.com

 

此时,生成lagou.py文件,lagou.py文件内

LagouSpider(CrawlSpider) ,即继承于 CrawlSpider ,不再是basic模板的 scrapy.Spider了。(注意CrawlSpider继承于scrapy下的Spider)

 lagou.py:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class LagouSpider(CrawlSpider):
    name = 'lagou'
    allowed_domains = ['www.lagou.com']
    start_urls = ['http://www.lagou.com/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i
lagou.py

 

关于CrawlSpider全站式爬取数据-相关介绍,请参考此链接:

 

https://www.cnblogs.com/Eric15/p/9941197.html

 

使用CrawlSpider爬取拉钩网数据-测试

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class LagouSpider(CrawlSpider):
    name = 'lagou'
    allowed_domains = ['www.lagou.com']
    start_urls = ['http://www.lagou.com/']

    rules = (
    # 3个规则
        Rule(LinkExtractor(allow=r'zhaopin/.*/'), follow=True),#爬取zhaopin下的所有url
        Rule(LinkExtractor(allow=r'gongsi/\d+.html/'), follow=True),#爬取gongsi下的所有url
        Rule(LinkExtractor(allow=r'jobs/\d+.html/'), callback='parse_job', follow=True),#爬取jobs下的所有url
    )

    def parse_job(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

 

设置断点并进行debug运行

 

爬取数据结果:

 link:

Link(url='https://www.lagou.com/zhaopin/Java/', text='Java', fragment='', nofollow=False) # 每个url、text(内容)

 links:

# 爬取的url → 属于https://www.lagou.com/zhaopin/下的所有url
<class 'list'>: [Link(url='https://www.lagou.com/zhaopin/Java/', text='Java', fragment='', nofollow=False),
          Link(url='https://www.lagou.com/zhaopin/PHP/', text='PHP', fragment='', nofollow=False),
          Link(url='https://www.lagou.com/zhaopin/C++/', text='C++', fragment='', nofollow=False),
          Link(url='https://www.lagou.com/zhaopin/qukuailian/', text='区块链', fragment='', nofollow=False),           ......

 

  response:

# https://www.lagou.com/:爬取的url页面
<200 https://www.lagou.com/>

 

 seen:

# 将爬取到的所有符合规则的url都放到seen集合中去
{Link(url='https://www.lagou.com/zhaopin/Java/', text='Java', fragment='', nofollow=False)}

 

  

关于Rules多个规则时爬取顺序:scrapy爬虫是使用异步机制,使用规则爬取数据时是没有区分先后顺序的

  

使用CrawlSpider全站式爬取拉钩网数据程序流程:

 1、首先,爬取主页(首页:https://www.lagou.com,假设这个是第一层url,只有一个url)数据,其中包含所有url链接(都存于response中)

→2、判断每个url(假设这是第二层url,第一层url爬取到的页面数据下的所有url)是否符合Rules用户自定义的规则,如果符合则爬取该条url数据(Rules没有执行顺序之分,只要符合Rules规则中的任意一条,就会按该条Rule规则进行request爬取数据)数据爬取完成后调用其对应的callback函数

→3、接着判断follow是否为True,如果为True,之后会在接着深度式爬取第三层url(在第二层每个url页面的url链接);为False则不再进一步数据爬取

→4、最后,通过对item数据的处理,将其入库即完成了拉勾网数据爬取任务

 

 follow注意的一个点,这个点在我的随便CrawlSpider源码介绍中没提到,需要注意:使用CrawlSpider爬取爬虫数据时,在Rules规则中假如我们设定了某个Rule规则的follow为False,则表示会在第二层url数据爬取后便不再进一步爬取,而不是爬取完第一层(首页)数据后便停止下一层数据提取。原因如下(源码分析):

 爬虫开始时我们调用的是start_request函数,之后调用的是parse函数,而parse函数传递的参数中有一个‘follow = True‘,程序要判断是否进行下一层数据爬取(跟进)时,是根据_parse_response函数下的这句代码:

if follow and self._follow_links:

 来判断是否进行下一层数据爬取的。实际上第一次传进去的follow是parse函数中传递的参数:follow = True ,此时无论Rules规则中的参数follow是否为False,此处判断都为True,即会进行下一层数据爬取(第二层)。直到下一次调用此函数(_parse_response)时,传进来的follow值才是Rule规则中用户自定义的follow值。此时才会终止爬虫的下一步数据爬取(follow = False)


 

最后一个问题 

 在爬取拉钩网url中,出现302重定向问题(要求登录)。此时我们可以通过:custom_settings,对一些default设置进行预设。爬取拉钩网时,需要登录获取cookies或者直接手动将cookies放置custom_settings中。

经过测试,爬取拉钩网时必须的两个参数分别是:cookies 、 User-Agent 。我们将其配置到custom_settings中就可以了。有多种方式:

1、直接手动将cookies添加进custom_settings中,浏览器访问(get请求就可以了),检查代码复制cookies数据及User_Agent数据到custom_settings中就可以了:

# cookies、User_Agent必须
custom_settings = {
        "COOKIES_ENABLED": False, 
        # "DOWNLOAD_DELAY": 1,   # 延时/秒
        'DEFAULT_REQUEST_HEADERS': {
                                       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
                                       'Accept-Encoding': 'gzip, deflate, br',
                                       'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
                                       'Connection': 'keep-alive',
                                       'Cookie': 'JSESSIONID=ABAAABAABEEAAJAF08A698E7D4CC5B1B474ED6DDA70F780; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541953902; _ga=GA1.2.1358601872.1541953903; user_trace_token=20181112003151-4b03ac83-e5cf-11e8-8882-5254005c3644; LGSID=20181112003151-4b03aeb8-e5cf-11e8-8882-5254005c3644; LGUID=20181112003151-4b03b056-e5cf-11e8-8882-5254005c3644; _gid=GA1.2.1875637681.1541953903; index_location_city=%E5%B9%BF%E5%B7%9E; TG-TRACK-CODE=index_navigation; SEARCH_ID=6edac9dbb3714a8780795d56ccdc7f78; LGRID=20181112013543-36e69b49-e5d8-11e8-888e-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541957735',
                                       'Host': 'www.lagou.com',
                                       'Origin': 'https://www.lagou.com',
                                       'Referer': 'https://www.lagou.com/',
                                       'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
                                   }

    }

 

2、使用selenium登录获取cookies,写入文件,再添加到 custom_settings 中就可以了,这种是自动型,比较推荐。关于这种方式读者需要自己去尝试,本文目前只在于测试没必要用到第二种故没特地测试处理。给段网上代码可以参考

from selenium import webdriver
from scrapy.selector import Selector
import time
 
def login_lagou():
    browser = webdriver.Chrome(executable_path="D:/chromedriver.exe")
    browser.get("https://passport.lagou.com/login/login.html")
    # 填充账号密码
    browser\
        .find_element_by_css_selector("body > section > div.left_area.fl > div:nth-child(2) > form > div:nth-child(1) > input")\
        .send_keys("username")
    browser\
        .find_element_by_css_selector("body > section > div.left_area.fl > div:nth-child(2) > form > div:nth-child(2) > input")\
        .send_keys("password")
 
    # 点击登陆按钮
    browser\
        .find_element_by_css_selector("body > section > div.left_area.fl > div:nth-child(2) > form > div.input_item.btn_group.clearfix > input")\
        .click()
    cookie_dict={}
    time.sleep(3)
    Cookies = browser.get_cookies()
    for cookie in Cookies:
        cookie_dict[cookie['name']] = cookie['value']
    # browser.quit()
 
    return cookie_dict
selenium获取cookies

 

另外还有通过requests第三方库登录,及使用scrapy自带request模拟登录都是可以的。方式有多种,有兴趣的可以自己尝试。


测试代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class LagouSpider(CrawlSpider):
    name = 'lagou'
    allowed_domains = ['www.lagou.com']
    start_urls = ['http://www.lagou.com/']

    rules = (
        Rule(LinkExtractor(allow=r'zhaopin/.*/'), follow=False),
        # Rule(LinkExtractor(allow=r'gongsi/\d+.html/'), follow=False),
        Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True),
    )

    custom_settings = {
        "COOKIES_ENABLED": False,
        # "DOWNLOAD_DELAY": 1,
        'DEFAULT_REQUEST_HEADERS': {
                                       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
                                       'Accept-Encoding': 'gzip, deflate, br',
                                       'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
                                       'Connection': 'keep-alive',
                                       'Cookie': 'JSESSIONID=ABAAABAABEEAAJAF08A698E7D4CC5B1B474ED6DDA70F780; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541953902; _ga=GA1.2.1358601872.1541953903; user_trace_token=20181112003151-4b03ac83-e5cf-11e8-8882-5254005c3644; LGSID=20181112003151-4b03aeb8-e5cf-11e8-8882-5254005c3644; LGUID=20181112003151-4b03b056-e5cf-11e8-8882-5254005c3644; _gid=GA1.2.1875637681.1541953903; index_location_city=%E5%B9%BF%E5%B7%9E; TG-TRACK-CODE=index_navigation; SEARCH_ID=6edac9dbb3714a8780795d56ccdc7f78; LGRID=20181112013543-36e69b49-e5d8-11e8-888e-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541957735',
                                       'Host': 'www.lagou.com',
                                       'Origin': 'https://www.lagou.com',
                                       'Referer': 'https://www.lagou.com/',
                                       'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
                                   }

    }
    # headers = {
    #     "HOST": "www.lagou.com",
    #     'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36"
    # }
    #
    # def _build_request(self, rule, link):
    #     r = Request(url=link.url, callback=self._response_downloaded,headers=self.headers)
    #     r.meta.update(rule=rule, link_text=link.text)
    #     return r


    def parse_job(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i
CrawlSpider - lagou.py

 二、CrawlSpider爬取拉勾网实战

新建lagou.py爬虫项目后,接下来就是对爬取目标的定位了,我们的目标是爬取拉勾网上每个招聘信息中以下的数据:

class LagouJobItem(scrapy.Item):
    # 拉勾网职位信息
    title = scrapy.Field()             # 标题
    url = scrapy.Field()
    url_object_id = scrapy.Field()     # url+md5加密
    salary = scrapy.Field()            # 薪资
    job_city = scrapy.Field()           # 工作城市
    work_years = scrapy.Field()         # 工作年限
    degree_need = scrapy.Field()       # 工作经验
    job_type = scrapy.Field()          # 工作类型(全职/兼职)
    publish_time = scrapy.Field()      # 发布时间
    job_advantage = scrapy.Field()     # 职位诱惑
    job_desc = scrapy.Field()          # 工作描述
    job_addr = scrapy.Field()          # 工作地点
    company_name = scrapy.Field()      # 公司名称
    company_url = scrapy.Field()       # 公司url
    tags = scrapy.Field()               # 职位标签
    crawl_time = scrapy.Field()        # 爬取时间

 

1、在items.py 中自定义ItemLoader:

class LagouJobItemLoader(ItemLoader):
    # 自定义拉钩ItemLoader
    default_output_processor = TakeFirst()

 

2、在lagou.py中进行相应数据的爬取:

    def parse_job(self, response):
        # 解析拉勾网的职位
        item_loader = LagouJobItemLoader(item=LagouJobItem(),response=response)

        item_loader.add_css("title",".job-name::attr(title)")
        item_loader.add_value("url",response.url)
        item_loader.add_value("url_object_id",get_md5(response.url))
        item_loader.add_css("salary",".job_request .salary::text")
        item_loader.add_css("job_city",".job_request span:nth-child(2)::text")  # 取到span标签的第二个(span标签)
        item_loader.add_css("work_years",".job_request span:nth-child(3)::text")
        item_loader.add_css("degree_need",".job_request span:nth-child(4)::text")
        item_loader.add_css("job_type",".job_request span:nth-child(5)::text")

        item_loader.add_css("tags",".position-label li::text")
        item_loader.add_css("publish_time",".publish_time::text")
        item_loader.add_css("job_advantage",".job-advantage p::text")
        item_loader.add_css("job_desc",".job_bt div")
        item_loader.add_css("job_addr",".work_addr")
        item_loader.add_css("company_name","#job_company dt a img::attr(alt)")
        item_loader.add_css("company_url","#job_company dt a::attr(href)")
        item_loader.add_value("crawl_time",datetime.now())

        lagou_job_item = item_loader.load_item()


        return lagou_job_item

然后,尝试debug对数据进行爬取,item结果如下:

 

3、在items.py中对数据进一步修改调整:

def remove_splash(value):
    # 去除 /
    return value.replace("/","")

def time_split(value):
    # 根据/分割,返回时间点  publish_time: 13:55 发布于拉勾网
    value_list = value.split(" ")
    return value_list[0]

def get_word_year(value):
    # 获取工作年限
    match_re = re.match(".*?((\d+)-?(\d*)).*", value)
    if match_re:
        word_year = match_re.group(1)
    else:
        word_year = "经验不限"
    return word_year

def get_job_addr(value):
    # 拼接地址,并去除无用信息
    addr_list = value.split("\n")
    addr_list = [item.strip() for item in addr_list if item.strip() != '查看地图']
    return "".join(addr_list)

def get_job_desc(value):
    # 拼接招聘内容描述
    desc_list = value.split("\n")
    desc_list = [item.strip() for item in desc_list]
    return "".join(desc_list)

 

在item.py中的 LagouJobItem类应用:

# 拉钩网爬取相关item
class LagouJobItemLoader(ItemLoader):
    # 自定义拉钩ItemLoader
    default_output_processor = TakeFirst()

class LagouJobItem(scrapy.Item):
    # 拉勾网职位信息
    title = scrapy.Field()             # 标题
    url = scrapy.Field()
    url_object_id = scrapy.Field()     # url+md5加密
    salary = scrapy.Field()            # 薪资
    job_city = scrapy.Field(           # 工作城市
        input_processor=MapCompose(remove_splash)
    )
    work_years = scrapy.Field(         # 工作年限
        input_processor=MapCompose(get_word_year)
    )
    degree_need = scrapy.Field(       # 工作经验
        input_processor=MapCompose(remove_splash)
    )
    job_type = scrapy.Field()          # 工作类型(全职/兼职)
    publish_time = scrapy.Field(      # 发布时间
        input_processor=MapCompose(time_split)
    )
    job_advantage = scrapy.Field()     # 职位诱惑
    job_desc = scrapy.Field(          # 工作描述
        input_processor=MapCompose(remove_tags,get_job_desc)
    )
    job_addr = scrapy.Field(          # 工作地点
        input_processor = MapCompose(remove_tags,get_job_addr)
    )
    company_name = scrapy.Field()      # 公司名称
    company_url = scrapy.Field()       # 公司url
    tags = scrapy.Field(               # 职位标签
        input_processor=Join("-")
    )
    crawl_time = scrapy.Field(        # 爬取时间
    )

 

再进行debug调试,爬取到的是我们理想中的数据类型:

 


 

4、接着就是将爬取到的数据存入数据库中了

 1)首先,我们需要先建表,及填写数据类型

 表名:lagou_job

 

2)开始数据库入库的相关操作

 我们之前在pipelines.py进行数据入库操作时,是直接在MysqlTwistedPipeline类中进行数据增删改查相关操作的,其实这样已经把入库操作写死。在一个爬虫项目中,当你需要对多个网站爬取数据,存储数据时每个网站的数据都需要进行入库处理,此时上述的入库操作便不适用了。

 修订数据入库操作其实很简单,异步入库数据库的基础设置都一样,主要就是增删改查不同,我们将这部分操作移到每个爬虫项目对应的Item中进行操作即可。具体操作如下:

旧版数据入库:

import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi

class MysqlTwistedPipeline(object):
    def __init__(self,dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls,settings):   # 用于读取配置文件信息,先于process_item调用
        dbparm = dict(
            host=settings["MYSQL_HOST"],
            db=settings["MYSQL_DBNAME"],
            user=settings["MYSQL_USER"],
            passwd=settings["MYSQL_PASSWORD"],
            charset='utf8',
            cursorclass=MySQLdb.cursors.DictCursor,  # 字典类型,还有一种json类型
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool("MySQLdb",**dbparm)  # tadbapi.ConnectionPool:wisted提供的一个用于异步化操作的连接处(容器)。将数据库模块,及连接数据库的参数等传入即可连接mysql
        return cls(dbpool)  # 实例化 MysqlTwistedPipeline

    def process_item(self, item, spider):
        # 操作数据时调用
        query = self.dbpool.runInteraction(self.sql_insert, item)  # 执行mysql语句相应操作 ,异步操作
        query.addErrback(self.handle_error, item, spider)  # 异常处理

    def handle_error(self, failure, item, spider):
        # 处理异常
        print("异常:", failure)

    def sql_insert(self, cursor, item):
        # 数据插入操作
        insert_sql = """
                            insert into article_spider(title, url, create_date, fav_nums,url_object_id)
                            VALUES (%s, %s, %s, %s,%s)
                        """
        cursor.execute(insert_sql,
                       (item["title"], item["url"], item["create_date"], item["fav_nums"], item["url_object_id"]))
MysqlTwistedPipeline

 

新版:

 首先,将数据增删改查操作相关转移到对应item对象中操作:

# items.py/LagouJobItem

    def get_insert_sql(self):
        insert_sql = """
            insert into lagou_job(title, url, url_object_id, salary, job_city, work_years, degree_need,
            job_type, publish_time, job_advantage, job_desc, job_addr, company_name, company_url,
            tags, crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
            ON DUPLICATE KEY UPDATE salary=VALUES(salary), job_desc=VALUES(job_desc)
        """
        params = (
            self["title"], self["url"], self["url_object_id"], self["salary"], self["job_city"],
            self["work_years"], self["degree_need"], self["job_type"],
            self["publish_time"], self["job_advantage"], self["job_desc"],
            self["job_addr"], self["company_name"], self["company_url"],
            self["job_addr"], self["crawl_time"].strftime(SQL_DATETIME_FORMAT),
        )

        return insert_sql, params

 

ON DUPLICATE KEY UPDATE解析:当插入数据时,该条数据主键存在(冲突),则进行更新数据操作。

 接着,更改MysqlTwistedPipline类中的do_insert函数:

    def do_insert(self, cursor, item):
        #执行具体的插入
        #根据不同的item 构建不同的sql语句并插入到mysql中
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)

 

setting.py中设置:

ITEM_PIPELINES = {

   'ArticleSpider.pipelines.MysqlTwistedPipeline': 4,  # 方式三:异步数据库保存item数据
}

使用debug进行调试测试,查看lago_job表数据,发现数据正不断存入:

 


 

 使用CrawlSpider爬取拉勾网数据-完整代码

 1、setting.py

import os

BOT_NAME = 'ArticleSpider'

SPIDER_MODULES = ['ArticleSpider.spiders']
NEWSPIDER_MODULE = 'ArticleSpider.spiders'


ROBOTSTXT_OBEY = False


# Enble or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'ArticleSpider.middlewares.ArticlespiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'ArticleSpider.middlewares.ArticlespiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
   # 'ArticleSpider.pipelines.ArticleImagePipeline': 2,  # 用于图片下载时调用
   # 'ArticleSpider.pipelines.JsonWithEncodingPipeline': 3,  # 方式一:用于保存item 数据 ,在图片下载之后再调用
   'ArticleSpider.pipelines.MysqlTwistedPipeline': 4,  # 方式三:异步数据库保存item数据
   # 'ArticleSpider.pipelines.JsonExporterPipleline': 3,  # 方式二:使用scrapy提供的JsonItemExporter保存json文件,用于保存item 数据
    # 'scrapy.pipelines.images.ImagesPipeline':1  # scrapy中的pipelines自带的ImagesPipeline,用于图片下载,另外还有图片、媒体下载

}
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE = os.path.join(BASE_DIR,'images')  # 名称是固定写法,文件保存路径
IMAGES_URLS_FIELD = "acticle_image_url" # 名称是固定写法。设定acticle_image_url字段为图片url,下载图片时找此字段对应的数据

ITEM_DATA_DIR = os.path.join(BASE_DIR,"item_data")  # item数据保存到当地item_data文件夹

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# mysql配置
MYSQL_HOST = "127.0.0.1"
MYSQL_DBNAME = "article_spider"
MYSQL_USER = "root"
MYSQL_PASSWORD = "0315"

SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
SQL_DATE_FORMAT = "%Y-%m-%d"
View Code

 

2、common.py

# md5 加密
import hashlib

def get_md5(url):
    if isinstance(url,str):  # python3中Unicode即是str
        url = url.encode("utf-8")
    m = hashlib.md5()
    m.update(url)
    return m.hexdigest()
common.py

 

3、items.py

from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join
from ArticleSpider.settings import SQL_DATETIME_FORMAT, SQL_DATE_FORMAT
import re
from w3lib.html import remove_tags  # 用于去除HTML标签

def return_value(value):
    return value

def remove_splash(value):
    return value.replace("/","")

def time_split(value):
    # 根据空格分割,返回时间点  publish_time: 13:55 发布于拉勾网
    value_list = value.split(" ")
    return value_list[0]

def get_word_year(value):
    # 获取工作年限
    match_re = re.match(".*?((\d+)-?(\d*)).*", value)
    if match_re:
        word_year = match_re.group(1)
    else:
        word_year = "经验不限"
    return word_year

def get_job_addr(value):
    # 拼接地址,并去除无用信息
    addr_list = value.split("\n")
    addr_list = [item.strip() for item in addr_list if item.strip() != '查看地图']
    return "".join(addr_list)

def get_job_desc(value):
    # 拼接招聘内容描述
    desc_list = value.split("\n")
    desc_list = [item.strip() for item in desc_list]
    return "".join(desc_list)

# 拉钩网爬取相关item
class LagouJobItemLoader(ItemLoader):
    # 自定义拉钩ItemLoader
    default_output_processor = TakeFirst()

class LagouJobItem(scrapy.Item):
    # 拉勾网职位信息
    title = scrapy.Field()             # 标题
    url = scrapy.Field()
    url_object_id = scrapy.Field()     # url+md5加密
    salary = scrapy.Field()            # 薪资
    job_city = scrapy.Field(           # 工作城市
        input_processor=MapCompose(remove_splash)
    )
    work_years = scrapy.Field(         # 工作年限
        input_processor=MapCompose(get_word_year)
    )
    degree_need = scrapy.Field(       # 工作经验
        input_processor=MapCompose(remove_splash)
    )
    job_type = scrapy.Field()          # 工作类型(全职/兼职)
    publish_time = scrapy.Field(      # 发布时间
        input_processor=MapCompose(time_split)
    )
    job_advantage = scrapy.Field()     # 职位诱惑
    job_desc = scrapy.Field(          # 工作描述
        input_processor=MapCompose(remove_tags,get_job_desc)
    )
    job_addr = scrapy.Field(          # 工作地点
        input_processor = MapCompose(remove_tags,get_job_addr)
    )
    company_name = scrapy.Field()      # 公司名称
    company_url = scrapy.Field()       # 公司url
    tags = scrapy.Field(               # 职位标签
        input_processor=Join("-")
    )
    crawl_time = scrapy.Field(        # 爬取时间
    )

    def get_insert_sql(self):
        insert_sql = """
            insert into lagou_job(title, url, url_object_id, salary, job_city, work_years, degree_need,
            job_type, publish_time, job_advantage, job_desc, job_addr, company_name, company_url,
            tags, crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
            ON DUPLICATE KEY UPDATE salary=VALUES(salary), job_desc=VALUES(job_desc)
        """
        params = (
            self["title"], self["url"], self["url_object_id"], self["salary"], self["job_city"],
            self["work_years"], self["degree_need"], self["job_type"],
            self["publish_time"], self["job_advantage"], self["job_desc"],
            self["job_addr"], self["company_name"], self["company_url"],
            self["job_addr"], self["crawl_time"].strftime(SQL_DATETIME_FORMAT),
        )

        return insert_sql, params
items.py

 

4、pipelines.py

import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi

class MysqlTwistedPipeline(object):
    def __init__(self,dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls,settings):   # 用于读取配置文件信息,先于process_item调用
        dbparm = dict(
            host=settings["MYSQL_HOST"],
            db=settings["MYSQL_DBNAME"],
            user=settings["MYSQL_USER"],
            passwd=settings["MYSQL_PASSWORD"],
            charset='utf8',
            cursorclass=MySQLdb.cursors.DictCursor,  # 字典类型,还有一种json类型
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool("MySQLdb",**dbparm)  # tadbapi.ConnectionPool:wisted提供的一个用于异步化操作的连接处(容器)。将数据库模块,及连接数据库的参数等传入即可连接mysql
        return cls(dbpool)  # 实例化 MysqlTwistedPipeline

    def process_item(self, item, spider):
        # 操作数据时调用
        query = self.dbpool.runInteraction(self.do_insert, item)  # 执行mysql语句相应操作 ,异步操作
        query.addErrback(self.handle_error, item, spider)  # 异常处理

    def handle_error(self, failure, item, spider):
        #处理异步插入的异常
        print (failure)

    def do_insert(self, cursor, item):
        #执行具体的插入
        #根据不同的item 构建不同的sql语句并插入到mysql中
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)
pipelines.py

 

5、lagou.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ArticleSpider.items import LagouJobItem,LagouJobItemLoader
from ArticleSpider.util.common import get_md5

from datetime import datetime


class LagouSpider(CrawlSpider):
    name = 'lagou'
    allowed_domains = ['www.lagou.com']
    start_urls = ['http://www.lagou.com/']

    rules = (
        Rule(LinkExtractor(allow=r'zhaopin/.*/'), follow=True),
        Rule(LinkExtractor(allow=r'gongsi/\d+.html/'), follow=True),
        Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse_job', follow=True),
    )

    custom_settings = {
        "COOKIES_ENABLED": False,
        "DOWNLOAD_DELAY": 3,
        'DEFAULT_REQUEST_HEADERS': {
                                       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
                                       'Accept-Encoding': 'gzip, deflate, br',
                                       'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
                                       'Connection': 'keep-alive',
                                       'Cookie': '_ga=GA1.2.1358601872.1541953903; user_trace_token=20181112003151-4b03ac83-e5cf-11e8-8882-5254005c3644; LGUID=20181112003151-4b03b056-e5cf-11e8-8882-5254005c3644; _gid=GA1.2.1875637681.1541953903; index_location_city=%E5%B9%BF%E5%B7%9E; JSESSIONID=ABAAABAAAGGABCB641A801FD52253622370040445465BDC; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541953902,1541989806; TG-TRACK-CODE=index_navigation; SEARCH_ID=0ee1c4af2c2d47dc84be450da8c8c8fc; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541992192; LGRID=20181112111000-70d55352-e628-11e8-9b85-525400f775ce',
                                       'Host': 'www.lagou.com',
                                       'Origin': 'https://www.lagou.com',
                                       'Referer': 'https://www.lagou.com/',
                                       'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
                                   }

    }


    def parse_job(self, response):
        # 解析拉勾网的职位
        item_loader = LagouJobItemLoader(item=LagouJobItem(),response=response)

        item_loader.add_css("title",".job-name::attr(title)")
        item_loader.add_value("url",response.url)
        item_loader.add_value("url_object_id",get_md5(response.url))
        item_loader.add_css("salary",".job_request .salary::text")
        item_loader.add_css("job_city",".job_request span:nth-child(2)::text")  # 取到span标签的第二个(span标签)
        item_loader.add_css("work_years",".job_request span:nth-child(3)::text")
        item_loader.add_css("degree_need",".job_request span:nth-child(4)::text")
        item_loader.add_css("job_type",".job_request span:nth-child(5)::text")

        item_loader.add_css("tags",".position-label li::text")
        item_loader.add_css("publish_time",".publish_time::text")
        item_loader.add_css("job_advantage",".job-advantage p::text")
        item_loader.add_css("job_desc",".job_bt div")
        item_loader.add_css("job_addr",".work_addr")
        item_loader.add_css("company_name","#job_company dt a img::attr(alt)")
        item_loader.add_css("company_url","#job_company dt a::attr(href)")
        item_loader.add_value("crawl_time",datetime.now())

        lagou_job_item = item_loader.load_item()


        return lagou_job_item
lagou.py

 

6、main.py

import os,sys

sys.path.append(os.path.dirname(os.path.abspath(__file__)))  # 将父路径添加至sys path中

execute(['scrapy','crawl','lagou',]) # 执行:scrapy crawl lagou。 其中'lagou'是lagou.py中JobboleSpider类的name字段数据
main.py


 

三、Scrapy突破反爬虫的限制

 

1、基础知识

爬虫:自动获取数据的程序,关键是批量的获取

反爬虫:使用技术手段防止爬虫程序的方法

误伤:反爬虫技术将普通用户识别为爬虫,效果再好也不能用

成本:反爬虫人力和机器成本

拦截:拦截率越高,误伤率越高

反爬虫的目的:

 

爬虫与反爬虫的对抗过程:

 


 

2、通过downloadmiddleware随机更换user-agent 

scrapy框架中有帮我们默认实现了默认的user-agent(默认为scrapy),我们要实现自定义随机更换user-agent就需要自定义UserAgentMiddleware。首先,在setting中开启DOWNLOADER_MIDDLEWARES 的相关配置:

DOWNLOADER_MIDDLEWARES = {
   'ArticleSpider.middlewares.ArticlespiderDownloaderMiddleware': 543,
}

 

接着,将scrapy自带的UserAgentMiddleware设置为None:

DOWNLOADER_MIDDLEWARES = {
   'ArticleSpider.middlewares.MyCustomDownloaderMiddleware': 543,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None  # 此行,避免使用了默认的user-agent middleware
}

 

然后,在middleware.py文件中,我们新建一个类:RandomUserAgentMiddlware

在这之前,我们需要在GitHub下载:fake-useragent 包, fake-useragent维护了很多user-agent版本,具体自行查看fake-useragent介绍。

首先,安装:pip install fake-useragent , 然后在项目中引用就行了。

RandomUserAgentMiddlware类代码:

from fake_useragent import UserAgent   # 引入fake-useragent的UserAgent
class RandomUserAgentMiddlware(object):
    #随机更换user-agent
    def __init__(self, crawler):
        super(RandomUserAgentMiddlware, self).__init__()
        self.ua = UserAgent()  # 实例化UserAgent
        self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random")  # 从setting中获取useragent的类型(Firefox、Chrome、IE或random)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_request(self, request, spider):
        def get_ua():
            return getattr(self.ua, self.ua_type)  # 根据setting中获取的useragent类型,映射真正方法

        request.headers.setdefault('User-Agent', get_ua())  # 添加到headers中

 

setting中配置:

1)useragent类型:

RANDOM_UA_TYPE = "random"

 

2)在DOWNLOADER_MIDDLEWARES中配置:

DOWNLOADER_MIDDLEWARES = {
   # 'ArticleSpider.middlewares.ArticlespiderDownloaderMiddleware': 543,
   'ArticleSpider.middlewares.RandomUserAgentMiddlware': 543,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

 

自此,便实现了user-agent的随机切换


 

上述操作完成后,在debug或运行时如果报错:fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached ,可分别尝试下述方法:

  • ua = UserAgent(use_cache_server=False)(如果不想使用宿主缓存服务器,可以禁用服务器缓存)
  • ua = UserAgent(cache=False)(如果不希望缓存数据库或不需要可写文件系统)

如果上面两种方法都不行,执行:

 ua = UserAgent(verify_ssl=False)


 

由于 fake-useragent 库维护的 user-agent 列表存放在在线网页上,过低版本依赖的列表网页可能就会报 404

随手更新:

ua.update()

查看全部user-agent:

ua.data_browsers

 

重新运行,在debug模式下,可以看到随机获取到一个user-agent版本,添加进headers:

 


3、使用西刺创建ip代理池,实现ip代理

西刺ip代理:http://www.xicidaili.com

ip动态变化:重启路由器等

ip代理的原理:不直接发送自己真实ip,而使用中间代理商(代理服务器),那么服务器不知道我们的ip也就不会把我们禁掉

测试:使用ip代理很简单,只需要在上面我们自定义的RandomUserAgentMiddlware类中的process_request函数加上一行代码:

request.meta["proxy"] = "http://118.190.95.35:9001"   # 使用的是西刺代理ip:118.190.95.35 ,端口:9001 ,类型:HTTP

这样,当爬虫向每个url爬取数据时,都会通过ip代理的形式向服务器发送请求。


 

上述只是简单实现ip代理模式,实际上使用单个ip代理也是很容易被发现,所以说应该采取类似上面使用随机useragent的方式,来实现随机ip代理,这样就大大降低了被反爬发现的几率。

首先,需要新建个脚本,将西刺代理上相关ip代理数据爬取到我们服务器的文件或数据库中

 

新建tools包:用来存放脚本文件

在tools包中新建脚本:crawl_xici_ip.py脚本文件,用于爬取西刺ip代理数据(ip、端口、协议类型、响应时间等),将数据存至数据库并调用(实现ip代理池)

1)爬取数据
import requests
from scrapy.selector import Selector
headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
    }
    for i in range(200):
        rep = requests.get("http://www.xicidaili.com/nn/{0}".format(i),headers=headers)
        # print(rep)
        selector = Selector(text=rep.text)  # 将返回的response响应数据(文本信息)传给Selector
        all_trs = selector.css("#ip_list tr")

        ip_list = []
        for tr in all_trs[1:]:  # 获取西刺网ip及端口等相关数据
            ip = tr.css("td:nth-child(2)::text").extract_first('')
            port = tr.css("td:nth-child(3)::text").extract_first('')
            anony_type = tr.css("td:nth-child(5)::text").extract_first('')
            proxy_type = tr.css("td:nth-child(6)::text").extract_first('')
            speed_str = tr.css("td:nth-child(8) div::attr(title)").extract_first('')
            if speed_str:
                speed = float(speed_str.split("")[0])
            else:
                speed = 9999.0

            ip_list.append((ip,port,anony_type,proxy_type,speed))  # 存入list

 2)存入数据库

import MySQLdb

conn = MySQLdb.connect(host = "127.0.0.1",user = "root",passwd = "*******",db = "article_spider",charset = "utf8")
cursor = conn.cursor()

        for ip_info in ip_list:
            cursor.execute(
                "insert into proxy_ip (ip,port,anony_type,proxy_type,speed) values('{0}','{1}','{2}','{3}',{4})".format(
                    ip_info[0],ip_info[1],ip_info[2],ip_info[3],ip_info[4]
                )
            )
            conn.commit()

 3)从数据库中取数据,取出数据后对数据(ip 、端口)进行测试,如果可用再返回,不可用则删除该条数据,重新取数据,如此循环

class GetIP(object):

    def delete_ip(self,ip):
        # 从数据库删除无效的ip
        delete_sql = """
        delete from proxy_ip where ip='{0}'
        """.format(ip)
        cursor.execute(delete_sql)
        conn.commit()
        return True

    def judge_ip(self,ip,port):
        # 使用代理模式访问百度,测试ip是否可用
        http_url = "http://www.baidu.com"
        proxy_url = "http://{0}:{1}".format(ip,port)  # 代理ip设置
        try:
            proxy_dict = {
                "http":proxy_url,
            }
            response = requests.get(http_url,proxies = proxy_dict)  # proxies要去传入的是个dict类型,键值对类型:"http":"http://www.baidu.com"等
        except Exception as e:
            print("Invalid ip and port")
            self.delete_ip(ip)
            return False
        else:
            code = response.status_code
            if code >=200 and code <300:
                print("Effective ip")
                return True
            else:
                print("Invalid ip and port")
                self.delete_ip(ip)
                return False

    def get_random_ip(self):
        # 随机获取mysql中某条数据的ip及端口
        random_sql = """
        select ip, port from proxy_ip where proxy_type='http'
        order by RAND()
        limit 1
        """
        result = cursor.execute(random_sql)
        for ip_info in cursor.fetchall():
            ip = ip_info[0]
            port = ip_info[1]

            judge_re = self.judge_ip(ip,port)
            if judge_re:            # 测试通过,表示该端口及ip可用,直接return即可
                return "http://{0}:{1}".format(ip,port)
            else:
                return self.get_random_ip()  # 测试失败,ip无效,重新获取随机ip

 

crawl_xici_ip.py完整代码:

import requests
from scrapy.selector import Selector
import MySQLdb

conn = MySQLdb.connect(host = "127.0.0.1",user = "root",passwd = "0315",db = "article_spider",charset = "utf8")
cursor = conn.cursor()

def crawl_ips():
    # 爬取西刺ip信息相关数据,并存入数据库
    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
    }
    for i in range(200):
        rep = requests.get("http://www.xicidaili.com/nn/{0}".format(i),headers=headers)
        # print(rep)
        selector = Selector(text=rep.text)  # 将返回的response响应数据(文本信息)传给Selector
        all_trs = selector.css("#ip_list tr")

        ip_list = []
        for tr in all_trs[1:]:  # 获取西刺网ip及端口等相关数据
            ip = tr.css("td:nth-child(2)::text").extract_first('')
            port = tr.css("td:nth-child(3)::text").extract_first('')
            anony_type = tr.css("td:nth-child(5)::text").extract_first('')
            proxy_type = tr.css("td:nth-child(6)::text").extract_first('')
            speed_str = tr.css("td:nth-child(8) div::attr(title)").extract_first('')
            if speed_str:
                speed = float(speed_str.split("")[0])
            else:
                speed = 9999.0

            ip_list.append((ip,port,anony_type,proxy_type,speed))  # 存入list


        for ip_info in ip_list:
            cursor.execute(
                "insert into proxy_ip (ip,port,anony_type,proxy_type,speed) values('{0}','{1}','{2}','{3}',{4})".format(
                    ip_info[0],ip_info[1],ip_info[2],ip_info[3],ip_info[4]
                )
            )
            conn.commit()


class GetIP(object):

    def delete_ip(self,ip):
        # 从数据库删除无效的ip
        delete_sql = """
        delete from proxy_ip where ip='{0}'
        """.format(ip)
        cursor.execute(delete_sql)
        conn.commit()
        return True

    def judge_ip(self,ip,port):
        # 使用代理模式访问百度,测试ip是否可用
        http_url = "http://www.baidu.com"
        proxy_url = "http://{0}:{1}".format(ip,port)  # 代理ip设置
        try:
            proxy_dict = {
                "http":proxy_url,
            }
            response = requests.get(http_url,proxies = proxy_dict)  # proxies要去传入的是个dict类型,键值对类型:"http":"http://www.baidu.com"等
        except Exception as e:
            print("Invalid ip and port")
            self.delete_ip(ip)
            return False
        else:
            code = response.status_code
            if code >=200 and code <300:
                print("Effective ip")
                return True
            else:
                print("Invalid ip and port")
                self.delete_ip(ip)
                return False

    def get_random_ip(self):
        # 随机获取mysql中某条数据的ip及端口
        random_sql = """
        select ip, port from proxy_ip where proxy_type='http'
        order by RAND()
        limit 1
        """
        result = cursor.execute(random_sql)
        for ip_info in cursor.fetchall():
            ip = ip_info[0]
            port = ip_info[1]

            judge_re = self.judge_ip(ip,port)
            if judge_re:            # 测试通过,表示该端口及ip可用,直接return即可
                return "http://{0}:{1}".format(ip,port)
            else:
                return self.get_random_ip()  # 测试失败,ip无效,重新获取随机ip
crawl_xici_ip.py

 

直接在RandomUserAgentMiddlware中使用获取随机ip代理:

from ArticleSpider.tools.crawl_xici_ip import GetIP  # 引人tools/crawl_xici_ip.py中自定义的脚本
    def process_request(self, request, spider):
        def get_ua():
            return getattr(self.ua, self.ua_type)  # 根据setting中获取的useragent类型,映射真正方法
        request.headers.setdefault('User-Agent', get_ua())  # 添加到headers中
        request.meta["proxy"] = self.get_ip.get_random_ip()  # 使用ip代理池,实现随机ip代理

 

 

如此,使用西刺网相关数据实现ip代理便完成了,但实际上西刺ip免费代理是不稳定的,如果有必要,还是建议使用收费版的ip代理。


 

使用收费版ip代理推荐

1:scrapy-crawlera(点击查看)

相关配置:

1)安装:

pip install scrapy-crawlera

2)相关配置(setting.py):

DOWNLOADER_MIDDLEWARES = {
    ...
    'scrapy_crawlera.CrawleraMiddleware': 610
}

CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = 'apikey'  # apikey:需要我们到官网注册,会获取到这个值,不过目前已经是收费的了

 

 

2.1)如果不用第二种方式(setting设置),也可以使用这种方式,在spider项目中使用:

class MySpider:
    crawlera_enabled = True
    crawlera_apikey = 'apikey'

 

3)在项目中使用:

scrapy.Request(
    'http://example.com',
    headers={
        'X-Crawlera-Max-Retries': 1,  # 此行代码实现
        ...
    },
)

 


 

2、tor 洋葱网络

可以将我们的ip多次包装,让服务端查不到我们的ip。需要VIP才能下载使用,有兴趣可以自己网上搜索相关教程查看。


 

转载于:https://www.cnblogs.com/Eric15/articles/9937614.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值