scrapy数据收集器数据远程获取

scrapy的数据收集器可以实时记录爬虫状态数据,默认在爬虫结束是打印:

C:\Anaconda2\Lib\site-packages\scrapy\statscollectors.py
class StatsCollector(object):

    def __init__(self, crawler):
        self._dump = crawler.settings.getbool('STATS_DUMP')
        self._stats = {}

    ......

    def close_spider(self, spider, reason):
        if self._dump:
            logger.info("Dumping Scrapy stats:\n" + pprint.pformat(self._stats),
                        extra={'spider': spider})
        self._persist_stats(self._stats, spider)

    def _persist_stats(self, stats, spider):
        pass

上面是数据收集器的源码,可以看到在close_spider中会将self._stats打印出来,默认收集的信息如下。 

结束时:
{'downloader/request_bytes': 20646,
 'downloader/request_count': 47,
 'downloader/request_method_count/POST': 47,
 'downloader/response_bytes': 673679,
 'downloader/response_count': 47,
 'downloader/response_status_count/200': 47,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2018, 7, 24, 6, 31, 1, 84791),
 'item_scraped_count': 460,
 'log_count/CRITICAL': 4,
 'log_count/DEBUG': 510,
 'log_count/ERROR': 1,
 'log_count/INFO': 74,
 'login_faild': False,
 'request_depth_max': 46,
 'response_received_count': 47,
 'scheduler/dequeued': 47,
 'scheduler/dequeued/memory': 47,
 'scheduler/enqueued': 48,
 'scheduler/enqueued/memory': 48,
 'spider_exceptions/KeyError': 1,
 'start_time': datetime.datetime(2018, 7, 24, 6, 12, 27, 74073)}

结束时会有finish_reason,finish_time两个值,在运行时没有。

数据收集器允许添加开发者收集自定义数据:

在spider中:

self.crawler.stats.set_value("login_faild", False)

在Middleware和Pipeline中:

spider.crawler.stats.set_value("login_faild", False)

进入正题如何远程获取收集器中的数据

第一种:保存后获取,在downloadMiddleware中将状态信息写入redis中,在读取:

import redis
import time
import requests


class StatCollectorMiddleware(object):
    def __init__(self):
        self.r = redis.Redis(host='localhost', port=6379, db=0)
        self.time = lambda: time.strftime('%Y-%m-%d %H:%M:%S')

    def process_request(self, request, spider):
        stats = spider.crawler.stats.get_stats()
        for key, value in stats.items():
            value = {"value": [self.time(), key_value]}
            self.insert2redis(key, value)

    def insert2redis(self, key, value):
        self.r.rpush(key, value)

第二种:使用Telnet Console获取,首先配置setting,允许外网访问,记得开启6023端口,

TELNETCONSOLE_HOST = '0.0.0.0'
TELNETCONSOLE_PORT = [6023, 6073]

默认配置

TELNETCONSOLE_HOST = '127.0.0.1'
TELNETCONSOLE_PORT = [6023, 6073]

然后使用

import telnetlib
tn = telnetlib.Telnet('192.168.2.89', port=6023, timeout=10)
tn.write('stats.get_stats()'+'\n')
tn.read_very_eager() 

结果:

In [1]: import telnetlib

In [2]: tn = telnetlib.Telnet('192.168.2.89', port=6023, timeout=10)

In [3]: tn.write('stats.get_stats()'+'\n')

In [4]: stat = tn.read_very_eager() 

In [5]: print stat

>>> stats.get_stats() 
{'log_count/INFO': 45, 'start_time': datetime.datetime(2018, 7, 24, 6, 49, 26, 572021), 'log_count/DEBUG': 394, 'login_faild': False, 'sched uler/enqueued/memory': 37, 'scheduler/enqueued': 37, 'scheduler/dequeued/memory': 37, 'scheduler/dequeued': 37, 'downloader/request_count':  37, 'downloader/request_method_count/POST': 37, 'downloader/request_bytes': 16286, 'downloader/response_count': 37, 'downloader/response_sta tus_count/200': 37, 'downloader/response_bytes': 531739, 'response_received_count': 37, 'item_scraped_count': 354, 'request_depth_max': 35,  'log_count/ERROR': 1, 'spider_exceptions/KeyError': 1, 'log_count/CRITICAL': 4} 
>>>  

当有多个爬虫时可以将端口配置在spider中,避免冲突:

custom_settings = {
    "TELNETCONSOLE_PORT": [6029, ]
}

这样就可以在远程获取状态信息了,虽然是在局域网测试,但是外网同样可以。

发布了19 篇原创文章 · 获赞 4 · 访问量 2万+
展开阅读全文

非常简单的scrapy代码但就是不清楚到底哪里出问题了,高手帮忙看看吧!

04-09

News_spider文件 # -*- coding: utf-8 -*- import scrapy import re from scrapy import Selector from News.items import NewsItem class NewsSpiderSpider(scrapy.Spider): name = "news_spider" allowed_domains = ["http://18.92.0.1"] start_urls = ['http://18.92.0.1/contents/7/121174.html'] def parse_detail(self, response): sel = Selector(response) items = [] item = NewsItem() item['title'] = sel.css('.div_bt::text').extract()[0] characters = sel.css('.div_zz::text').extract()[0].replace("\xa0","") pattern = re.compile('[:].*[ ]') result = pattern.search(characters) item['post'] = result.group().replace(":","").strip() pattern = re.compile('[ ][^发]*') result = pattern.search(characters) item['approver'] = result.group() pattern = re.compile('[201].{9}') result = pattern.search(characters) item['date_of_publication'] = result.group() pattern = re.compile('([0-9]+)$') result = pattern.search(characters) item['browse_times'] = result.group() content = sel.css('.xwnr').extract()[0] pattern = re.compile('[\u4e00-\u9fa5]|[,、。“”]') result = pattern.findall(content) item['content'] = ''.join(result).replace("仿宋"," ").replace("宋体"," ").replace("楷体"," ") item['img1_url'] = sel.xpath('//*[@id="newpic"]/div[1]/div[1]/img/@src').extract()[0] item['img1_name'] = sel.xpath('//*[@id="newpic"]/div[1]/div[2]/text()').extract()[0] item['img2_url'] = sel.xpath('//*[@id="newpic"]/div[2]/div[1]/img/@src').extract()[0] item['img2_name'] = sel.xpath('//*[@id="newpic"]/div[2]/div[2]').extract()[0] item['img3_url'] = sel.xpath('//*[@id="newpic"]/div[3]/div[1]/img/@src').extract()[0] item['img3_name'] = sel.xpath('//*[@id="newpic"]/div[3]/div[2]/text()').extract()[0] item['img4_url'] = sel.xpath('//*[@id="newpic"]/div[4]/div[1]/img/@src').extract()[0] item['img4_name'] = sel.xpath('//*[@id="newpic"]/div[4]/div[2]/text()').extract()[0] item['img5_url'] = sel.xpath('//*[@id="newpic"]/div[5]/div[1]/img/@src').extract()[0] item['img5_name'] = sel.xpath('//*[@id="newpic"]/div[5]/div[2]/text()').extract()[0] item['img6_url'] = sel.xpath('//*[@id="newpic"]/div[6]/div[1]/img/@src').extract()[0] item['img6_name'] = sel.xpath('//*[@id="newpic"]/div[6]/div[2]/text()').extract()[0] characters = sel.xpath('/html/body/div/div[2]/div[4]/div[4]/text()').extract()[0].replace("\xa0","") pattern = re.compile('[:].*?[ ]') result = pattern.search(characters) item['company'] = result.group().replace(":", "").strip() pattern = re.compile('[ ][^联]*') result = pattern.search(characters) item['writer_photography'] = result.group() pattern = re.compile('(([0-9]|[-])+)$') result = pattern.search(characters) item['tel'] = result.group() items.append(item) items文件 return items import scrapy class NewsItem(scrapy.Item): title = scrapy.Field() post = scrapy.Field() approver = scrapy.Field() date_of_publication = scrapy.Field() browse_times = scrapy.Field() content = scrapy.Field() img1_url = scrapy.Field() img1_name = scrapy.Field() img2_url = scrapy.Field() img2_name = scrapy.Field() img3_url = scrapy.Field() img3_name = scrapy.Field() img4_url = scrapy.Field() img4_name = scrapy.Field() img5_url = scrapy.Field() img5_name = scrapy.Field() img6_url = scrapy.Field() img6_name = scrapy.Field() company = scrapy.Field() writer_photography = scrapy.Field() tel = scrapy.Field() pipelines文件 import MySQLdb import MySQLdb.cursors class NewsPipeline(object): def process_item(self, item, spider): return item class MysqlPipeline(object): def __init__(self): self.conn = MySQLdb.connect('192.168.254.129','root','root','news',charset="utf8",use_unicode=True) self.cursor = self.conn.cursor() def process_item(self, item, spider): insert_sql = "insert into news_table(title,post,approver,date_of_publication,browse_times,content,img1_url,img1_name,img2_url,img2_name,img3_url,img3_name,img4_url,img4_name,img5_url,img5_name,img6_url,img6_name,company,writer_photography,tel)VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)" self.cursor.execute(insert_sql,(item['title'],item['post'],item['approver'],item['date_of_publication'],item['browse_times'],item['content'],item['img1_url'],item['img1_name'],item['img1_url'],item['img1_name'],item['img2_url'],item['img2_name'],item['img3_url'],item['img3_name'],item['img4_url'],item['img4_name'],item['img5_url'],item['img5_name'],item['img6_url'],item['img6_name'],item['company'],item['writer_photography'],item['tel'])) self.conn.commit() setting文件 BOT_NAME = 'News' SPIDER_MODULES = ['News.spiders'] NEWSPIDER_MODULE = 'News.spiders' ROBOTSTXT_OBEY = False COOKIES_ENABLED = True ITEM_PIPELINES = { #'News.pipelines.NewsPipeline': 300, 'News.pipelines.MysqlPipeline': 300, } /usr/bin/python3.5 /home/pzs/PycharmProjects/News/main.py 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: News) 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'News', 'SPIDER_MODULES': ['News.spiders'], 'NEWSPIDER_MODULE': 'News.spiders'} 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled item pipelines: ['News.pipelines.MysqlPipeline'] 2017-04-08 11:00:12 [scrapy.core.engine] INFO: Spider opened 2017-04-08 11:00:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-08 11:00:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-08 11:00:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://18.92.0.1/contents/7/121174.html> (referer: None) 2017-04-08 11:00:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://18.92.0.1/contents/7/121174.html> (referer: None) Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python3.5/dist-packages/scrapy/spiders/__init__.py", line 76, in parse raise NotImplementedError NotImplementedError 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-08 11:00:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 229, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 16609, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 4, 8, 18, 0, 13, 938637), 'log_count/DEBUG': 2, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/NotImplementedError': 1, 'start_time': datetime.datetime(2017, 4, 8, 18, 0, 12, 917719)} 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Spider closed (finished) Process finished with exit code 0 直接运行会弹出NotImplementedError错误,单步调试也看不出到底哪里出了问题 问答

请问scrapy为什么会爬取失败

10-26

C:\Users\Administrator\Desktop\新建文件夹\xiaozhu>python -m scrapy crawl xiaozhu 2019-10-26 11:43:11 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: xiaozhu) 2019-10-26 11:43:11 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9 .5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.5.3 (v 3.5.3:1880cb95a742, Jan 16 2017, 15:51:26) [MSC v.1900 32 bit (Intel)], pyOpenSS L 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-7-6.1 .7601-SP1 2019-10-26 11:43:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'xi aozhu', 'SPIDER_MODULES': ['xiaozhu.spiders'], 'NEWSPIDER_MODULE': 'xiaozhu.spid ers'} 2019-10-26 11:43:11 [scrapy.extensions.telnet] INFO: Telnet Password: c61bda45d6 3b8138 2019-10-26 11:43:11 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-10-26 11:43:12 [scrapy.middleware] INFO: Enabled item pipelines: [] 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Spider opened 2019-10-26 11:43:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag es/min), scraped 0 items (at 0 items/min) 2019-10-26 11:43:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-10-26 11:43:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting ( 307) to <GET https://bizverify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozh u.com%2Ffangzi%2F125535477903.html> from <GET http://bj.xiaozhu.com/fangzi/12553 5477903.html> 2019-10-26 11:43:12 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://bizve rify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozhu.com%2Ffangzi%2F125535477 903.html> (referer: None) 2019-10-26 11:43:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://bizverify.xiaozhu.com?slideRedirect=https%3A%2F%2Fbj.xiaozhu.com%2 Ffangzi%2F125535477903.html>: HTTP status code is not handled or not allowed 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Closing spider (finished) 2019-10-26 11:43:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 529, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 725, 'downloader/response_count': 2, 'downloader/response_status_count/307': 1, 'downloader/response_status_count/400': 1, 'elapsed_time_seconds': 0.427734, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 10, 26, 3, 43, 12, 889648), 'httperror/response_ignored_count': 1, 'httperror/response_ignored_status_count/400': 1, 'log_count/DEBUG': 2, 'log_count/INFO': 11, 'response_received_count': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2019, 10, 26, 3, 43, 12, 461914)} 2019-10-26 11:43:12 [scrapy.core.engine] INFO: Spider closed (finished) 问答

scrapy 运行抛出NotImplementedError,请问一般什么原因造成呢?

04-09

/usr/bin/python3.5 /home/pzs/PycharmProjects/News/main.py 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: News) 2017-04-08 11:00:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'News', 'SPIDER_MODULES': ['News.spiders'], 'NEWSPIDER_MODULE': 'News.spiders'} 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.telnet.TelnetConsole',  'scrapy.extensions.corestats.CoreStats',  'scrapy.extensions.logstats.LogStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',  'scrapy.downloadermiddlewares.retry.RetryMiddleware',  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',  'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',  'scrapy.spidermiddlewares.referer.RefererMiddleware',  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',  'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-04-08 11:00:12 [scrapy.middleware] INFO: Enabled item pipelines: ['News.pipelines.MysqlPipeline'] 2017-04-08 11:00:12 [scrapy.core.engine] INFO: Spider opened 2017-04-08 11:00:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-08 11:00:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-04-08 11:00:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://18.92.0.1/contents/7/121174.html> (referer: None) 2017-04-08 11:00:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://18.92.0.1/contents/7/121174.html> (referer: None) Traceback (most recent call last):   File "/usr/local/lib/python3.5/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks     current.result = callback(current.result, *args, **kw)   File "/usr/local/lib/python3.5/dist-packages/scrapy/spiders/__init__.py", line 76, in parse     raise NotImplementedError NotImplementedError 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-08 11:00:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 229,  'downloader/request_count': 1,  'downloader/request_method_count/GET': 1,  'downloader/response_bytes': 16609,  'downloader/response_count': 1,  'downloader/response_status_count/200': 1,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2017, 4, 8, 18, 0, 13, 938637),  'log_count/DEBUG': 2,  'log_count/ERROR': 1,  'log_count/INFO': 7,  'response_received_count': 1,  'scheduler/dequeued': 1,  'scheduler/dequeued/memory': 1,  'scheduler/enqueued': 1,  'scheduler/enqueued/memory': 1,  'spider_exceptions/NotImplementedError': 1,  'start_time': datetime.datetime(2017, 4, 8, 18, 0, 12, 917719)} 2017-04-08 11:00:13 [scrapy.core.engine] INFO: Spider closed (finished) Process finished with exit code 0 直接运行会弹出NotImplementedError错误,单步调试也看不出到底哪里出了问题 问答

用anaconda的scrapy爬取数据,按照步骤设置好了,却爬不到数据,求助大神救救菜鸟

05-12

这是运行的全部结果: (D:\Anaconda2) C:\Users\luyue>cd C:\Users\luyue\movie250 (D:\Anaconda2) C:\Users\luyue\movie250>scrapy crawl movie250 -o items.json 2017-05-12 19:24:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: movie250) 2017-05-12 19:24:26 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'movie250.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['movie250.spiders'], 'BOT_NAME': 'movie250', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'} 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-05-12 19:24:26 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-05-12 19:24:26 [scrapy.core.engine] INFO: Spider opened 2017-05-12 19:24:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-05-12 19:24:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-05-12 19:24:26 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://movie.douban.com/robots.txt> (referer: None) 2017-05-12 19:24:26 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://movie.douban.com/top250/> (referer: None) 2017-05-12 19:24:27 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://movie.douban.com/top250/>: HTTP status code is not handled or not allowed 2017-05-12 19:24:27 [scrapy.core.engine] INFO: Closing spider (finished) 2017-05-12 19:24:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 445, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 496, 'downloader/response_count': 2, 'downloader/response_status_count/403': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 5, 12, 11, 24, 27, 13000), 'log_count/DEBUG': 3, 'log_count/INFO': 8, 'response_received_count': 2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2017, 5, 12, 11, 24, 26, 675000)} 2017-05-12 19:24:27 [scrapy.core.engine] INFO: Spider closed (finished) 问答

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 编程工作室 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览