python运行脚本没反应_python – Scrapy从脚本运行不起作用

我正在尝试使用scrapy crall single运行完美运行的scrapy蜘蛛,但我无法在python脚本中运行它.

主要问题是从不执行SingleBlogSpider.parse方法,而执行start_requests

这是运行该脚本的代码和输出.我还试图将执行移动到一个单独的文件,但同样的情况发生.

from urlparse import urlparse

from scrapy.http import Request

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class SingleBlogSpider(BaseSpider):

name = 'single'

def __init__(self,**kwargs):

super(SingleBlogSpider,self).__init__(**kwargs)

url = kwargs.get('url') or kwargs.get('domain') or 'seaofshoes.com'

if not url.startswith('http://') and not url.startswith('https://'):

url = 'http://%s/' % url

self.url = url

self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]

self.link_extractor = SgmlLinkExtractor()

self.cookies_seen = set()

print 0,self.url

def start_requests(self):

print '1',self.url

return [Request(self.url,callback=self.parse)]

def parse(self,response):

print '2'

# Actual scraper code,that is never executed

if __name__ == '__main__':

from twisted.internet import reactor

from scrapy.crawler import Crawler

from scrapy.settings import Settings

from scrapy import log,signals

spider = SingleBlogSpider(domain='scrapinghub.com')

crawler = Crawler(Settings())

crawler.signals.connect(reactor.stop,signal=signals.spider_closed)

crawler.configure()

crawler.crawl(spider)

crawler.start()

log.start()

reactor.run()

输出:

0 http://scrapinghub.com/

1 http://scrapinghub.com/

2013-09-13 14:21:46-0500 [single] INFO: Closing spider (finished)

2013-09-13 14:21:46-0500 [single] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 221,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 9403,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2013,9,13,19,21,46,563184),'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'start_time': datetime.datetime(2013,328961)}

2013-09-13 14:21:46-0500 [single] INFO: Spider closed (finished)

该程序永远不会到达SingleBlogSpider.parse并打印’2′,因此它不会抓取任何内容.但是你可以在输出上看到它确实发出了请求,所以不确定是什么.

Scrapy版本== 0.18.2

我真的无法发现错误,真的很感激帮助.

谢谢!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值