scrapy爬虫之反反爬虫措施

1.禁用Cookie

部分网站会通过用户的Cookie信息对用户进行识别与分析,所以要防止目标网站识别我们的会话信息。

在Scrapy中,默认是打开cookie的 (#COOKIES_ENABLED = False

设置为:COOKIES_ENABLED = False (cookie启用:no),对于需要cookie的可以在请求头中headers加入cookie

class LagouspiderSpider(scrapy.Spider):
    name = "lagouspider"
    allowed_domains = ["www.lagou.com"]

    url = 'https://www.lagou.com/jobs/positionAjax.json?'#city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false'
    page = 1
    allpage =0

    cookie = 'JSESSIONID=ABAAABAAAFCAAEG34858C57541C1F9DF75AED18C3065736; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1524281748;  04797acf-4515-11e8-90b5- LGSID=20180421130026-e7e614d7-4520-PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%3Fcity%3D%25E6%25B7%25B1%25E5%259C%25B3%26cl%3Dfalse%26fromSearch%3Dtrue%26labelWords%3D%26suginput%3D; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2F4302345.html; LGRID=20180421130208-24b73966-4521-11e8-90f2-525400f775ce; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1524286956'
    headers = {'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
               'Referer': 'https://www.lagou.com/jobs/list_python?city=%E6%B7%B1%E5%9C%B3&cl=false&fromSearch=true&labelWords=&suginput=',
               'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
               'cookie': cookie }

    def start_requests(self):
       yield scrapy.FormRequest(self.url, headers=self.headers, formdata={
            'first': 'true','pn': str(self.page),'kd': 'python','city': '深圳'}, callback=self.parse)


2.设置下载延时

在Scrapy中,默认是关闭请求下载延时的(#DOWNLOAD_DELAY = 3)

去掉#,或者在spider的请求间中加入 time.sleep(random.randint(5, 10))

    def parse(self, response):
        #print(response.text)
        item = LagouItem()
        data = json.loads(response.text)

        totalCount = data['content']['positionResult']['totalCount']#总共多少条信息
        resultSize = data['content']['positionResult']['resultSize']#每页多少条信息

        result = data['content']['positionResult']['result']#得到一个包含15个信息的列表
        for each in result:
            for field in item.fields:
                if field in each.keys():
                    item[field] = each.get(field)
            yield item

        time.sleep(random.randint(5, 10))

        if int(resultSize):
            self.allpage = int(totalCount) // int(resultSize) + 1
            if self.page < self.allpage:
                self.page += 1
                yield scrapy.FormRequest(self.url, headers=self.headers, formdata={
            'first': 'false','pn': str(self.page),'kd': 'python','city': '深圳'}, callback=self.parse)

3.设置USER-AGENT和代理ip

settings中:

DOWNLOADER_MIDDLEWARES = {
    'doubanMongo.middlewares.RandomUserAgent': 300,
    'doubanMongo.middlewares.RandomProxy':400
}

USER_AGENTS = [
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)',
    'Opera/9.27 (Windows NT 5.2; U; zh-cn)',
    'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',
    'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
    'Mozilla/5.0 (Linux; U; Android 4.0.3; zh-cn; M032 Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
    'Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13'
]

PROXIES=[{'ip_port':'117.48.214.249:16817','user_passwd':'632345244:4tf9pcpw'}
        #{'ip_port':'117.48.214.249:16817','user_passwd':''},
        #{'ip_port':'117.48.214.249:16817','user_passwd':''},
        #{'ip_port':'117.48.214.249:16817','user_passwd':''}
       ]

middlewares中:

from scrapy.conf import settings
import base64
import random


class RandomProxy(object):

    def process_request(self, request, spider):
        proxy = random.choice(settings["PROXIES"])
        if proxy['user_passwd'] is None:
            request.meta['proxy'] = 'http://'+ proxy['ip_port']
        else:
            # 对账户密码进行base64编码转换
            b_pw =  bytes(proxy['user_passwd'], encoding = "utf-8")#string转为bytes
            base64_userpasswd = base64.encodestring(b_pw)#需要的参数是bytes对象
            # 对应到代理服务器的信令格式里
            s_base64_userpasswd = str(base64_userpasswd, encoding="utf-8") #bytes转为string
            request.headers['Proxy-Authorization'] = 'Basic ' + s_base64_userpasswd
            request.meta['proxy'] = "http://" + proxy['ip_port']


class RandomUserAgent(object):

    def process_request(self, request, spider):
        useragent = random.choice(settings["USER_AGENTS"])
        #print(useragent)
        request.headers.setdefault('User-Agent',useragent)



基于Python Scrapy实现的蜂鸟数据采集爬虫系统 含代理、日志处理和全部源代码等 import scrapy from fengniao.items import FengniaoItem from scrapy.spidermiddlewares.httperror import HttpError from twisted.internet.error import TimeoutError, TCPTimedOutError, DNSLookupError, ConnectionRefusedError class FengniaoclawerSpider(scrapy.Spider): name = 'fengniaoClawer' allowed_domains = ['fengniao.com'] # 爬虫自定义设置,会覆盖 settings.py 文件中的设置 custom_settings = { 'LOG_LEVEL': 'DEBUG', # 定义log等级 'DOWNLOAD_DELAY': 0, # 下载延时 'COOKIES_ENABLED': False, # enabled by default 'DEFAULT_REQUEST_HEADERS': { # 'Host': 'www.fengniao.com', 'Referer': 'https://www.fengniao.com', }, # 管道文件,优先级按照由小到大依次进入 'ITEM_PIPELINES': { 'fengniao.pipelines.ImagePipeline': 100, 'fengniao.pipelines.FengniaoPipeline': 300, }, # 关于下载图片部分 'IMAGES_STORE': 'fengniaoPhoto', # 没有则新建 'IMAGES_EXPIRES': 90, # 图片有效期,已经存在的图片在这个时间段内不会再下载 'IMAGES_MIN_HEIGHT': 100, # 图片最小尺寸(高度),低于这个高度的图片不会下载 'IMAGES_MIN_WIDTH': 100, # 图片最小尺寸(宽度),低于这个宽度的图片不会下载 # 下载中间件,优先级按照由小到大依次进入 'DOWNLOADER_MIDDLEWARES': { 'fengniao.middlewares.ProxiesMiddleware': 400, 'fengniao.middlewares.HeadersMiddleware': 543, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, }, 'DEPTH_PRIORITY': 1, # BFS,是以starts_url为准,局部BFS,受CONCURRENT_REQUESTS影响 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue', 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue', 'REDIRECT_PRIORITY_ADJUST': 2, # Default: +2 'RETRY_PRIORITY_ADJUST': -1, # Default: -1 'RETRY_TIMES': 8, # 重试次数 # Default: 2, can also be specified per-request using max_retry_times attribute of Request.meta 'DOWNLOAD_TIMEOUT': 30, # This timeout can be set per spider using download_timeout spider attribute and per-request using download_timeout Request.meta key # 'DUPEFILTER_CLASS': "scrapy_redis.dupefilter.RFPDupeFilter", # 'SCHEDULER': "scrapy_redis.scheduler.Scheduler", # 'SCHEDULER_PERSIST': False, # Don't cleanup red
在面对反爬虫技术时,Scrapy可以采取一些措施进行反反爬虫。其中一种方法是通过降低请求频率来模仿人类用户的行为。在Scrapy的配置文件settings.py中,可以设置DOWNLOAD_DELAY参数来指定请求的时间间隔。通过延迟请求,使得爬虫的行为更接近真实用户的访问频率。例如,设置DOWNLOAD_DELAY = 3,表示两次请求之间的间隔为3秒。此外,还可以使用随机延迟时间来避免请求过于规律,进一步增加爬虫被识别的难度。 另一种反反爬虫措施是修改Scrapy的User-Agent请求头。通过模拟不同的浏览器或设备类型,使得爬虫程序看起来更像是真实的用户在访问网站。这样可以绕过一些简单的反爬虫技术,如基于User-Agent的验证。可以在Scrapy的中间件中设置User-Agent的随机切换,或者使用代理IP来发送请求,增加请求的多样性,提高反爬虫的成功率。 此外,如果网站使用了robots.txt文件来限制爬虫访问,Scrapy可以通过在配置文件settings.py中取消ROBOTSTXT_OBEY的注释来忽略对robots.txt文件的遵守。这样可以强行爬取站点信息,绕过对爬虫的限制。例如,取消如下代码的注释:ROBOTSTXT_OBEY = False。 综上所述,通过降低请求频率、修改User-Agent请求头以及忽略robots.txt文件,Scrapy可以采取一些反反爬虫措施来应对网站的反爬虫技术。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [Python Scrapy反爬虫常见解决方案(包含5种方法)](https://blog.csdn.net/qq_30235073/article/details/96073042)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *3* [Python scrapy 爬虫入门(七)突破反爬虫技术](https://download.csdn.net/download/weixin_38670707/13749305)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值