scrapy爬虫防止被反爬的5个策略

本文介绍了使用Scrapy框架防止被反爬的五个策略:设置延迟时间、禁用Cookies、使用User Agent池(包括单个更换、通过settings.py设置列表、使用downloadermiddleware设置列表)、使用IP池以及分布式爬取。详细讲解了如何在settings.py和middleware.py中进行配置。
摘要由CSDN通过智能技术生成

Preventing from being banned with scrapy structure

1. delay time

import time
# first
time.sleep()
# second one, which can be used in setting.py or spider
download_delay = ***

2. Ban cookies

# Disable cookies (enabled by default) in settings.py
COOKIES_ENABLED = True

3. User Agent pool

3.1 Single change

in scrapy, you can check the user-agent by using request.headers

scrapy shell example.com
request.headers

you need to enable the setting in the settings.py

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'xxxxxxxxxx'
3.2 Using the user-agent list.
3.21 set the user-agent list in settings.py

default setting in settings.py is

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'txposition (+http://www.yourdomain.com)'

Change it to …

import random

USER_AGENT_LIST = [
    
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值