常见的debug信息
如果我们的爬取的url地址不在我们设置的allowed_domains即是被爬取网站的域名下面,会出现什么样的情况呢?
allowed_domains = ['sun0769debug.com']
start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']
注意:为了使能够看到现象,必须要做好下面的工作
1.修改域名,使其与网站的url地址不一致
2.在settings.py中不要设置LOG_LEVEL这一参数,或者是将其值设置的很小很小
好了,现在万事俱备,启动爬虫
scrapy crawl sun
下面是完整的输出信息:
2020-10-11 20:58:59 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: Sun)
2020-10-11 20:58:59 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, pars
el 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.7 (tags/v3.7.7:d7c567b08f, Mar 10 2020, 10:41:24) [MSC
v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Win
dows-10-10.0.17134-SP0
2020-10-11 20:58:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-11 20:58:59 [scrapy.crawler] INFO: Overridden settings:
{
'BOT_NAME': 'Sun',
'NEWSPIDER_MODULE': 'Sun.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['Sun.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
2020-10-11 20:58:59 [scrapy.extensions.telnet] INFO: Telnet Password: effafb3a8e54b45d
2020-10-11 20:58:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-10-11 20:59:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddl