Preventing from being banned with scrapy structure
1. delay time
import time
# first
time.sleep()
# second one, which can be used in setting.py or spider
download_delay = ***
2. Ban cookies
# Disable cookies (enabled by default) in settings.py
COOKIES_ENABLED = True
3. User Agent pool
3.1 Single change
in scrapy, you can check the user-agent by using request.headers
scrapy shell example.com
request.headers
you need to enable the setting in the settings.py
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'xxxxxxxxxx'
3.2 Using the user-agent list.
3.21 set the user-agent list in settings.py
default setting in settings.py is
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'txposition (+http://www.yourdomain.com)'
Change it to …
import random
USER_AGENT_LIST = [