setting文件讲解
为什么需要setting.py文件
用于存放公共变量(如sql连接的host,usa_agent存放)
setting便于用户程序修改公共配置,需要改配置只需要在setting中改一次就行(setting中变量尽量用大写,便于理解)
项目名
BOT_NAME = 'tencent'
爬虫存放位置
SPIDER_MODULES = ['tencent.spiders']
新建爬虫会在什么位置
NEWSPIDER_MODULE = 'tencent.spiders'
ua标识
USER_AGENT= 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36'
默认遵守机器人协议(可改为false)
ROBOTSTXT_OBEY = True
详细含义:默认true情况下会去请求ROBOTSTXT
设置最大并发请求(数字越大,请求频率越快,但也有更大的可能被识别出为爬虫)
CONCURRENT_REQUESTS = 32
下载延迟(每次请求之前睡三秒,用于控制爬虫的频率,功能与time.sleep()一样)
DOWNLOAD_DELAY = 3
每个域名的最大变化请求数(配合DOWNLOAD_DELAY使用)
对单个网站进行并发请求的最大值
CONCURRENT_REQUESTS_PER_DOMAIN = 16
每个ip的最大变化请求数(配合DOWNLOAD_DELAY使用)
对单个IP进行并发请求的最大值
CONCURRENT_REQUESTS_PER_IP = 16
选择是否开启COOKIES(默认为开启)
#Disable cookies (enabled by default)
#COOKIES_ENABLED = False
Telnet控制台(默认启用)
#TELNETCONSOLE_ENABLED = False
默认请求头(ua和cookies不用放里面,他们有默认的存放位置)
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
爬虫中间件(默认开启)
#SPIDER_MIDDLEWARES = {
# 'tencent.middlewares.TencentSpiderMiddleware': 543,
#}
下载中间件(默认开启)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'tencent.middlewares.TencentDownloaderMiddleware': 543,
#}
插件(默认开启)
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
数据管道(手动开启)
ITEM_PIPELINES = {
'tencent.pipelines.TencentPipeline': 300,
}
自动限速(默认关闭)
#AUTOTHROTTLE_ENABLED = True
http缓存(默认关闭)
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
关于在spider和pipeline等程序导入setting中的变量(如sql_host等)
from tencent.settings import MYSQL_HOST
爬虫中调用settings变量(两种取一种)
self.settings["MYSQL_HOST"]
self.settings.get("MYSQL_HOST","")
pipeline中调用settings
spider.settings.get("MYSQL_HOST")
pipeline深入
open_spider和close_spider只会在爬虫开启和结束时执行一次
取某一个特定字符前的字符串
item["com"] = item["tag"].split("/")[1]