我想继续从数据库中获取要抓取的网址.到目前为止,我成功地从基地获取网址,但我希望我的蜘蛛继续从该基地读取,因为该表将由另一个线程填充.
我有一个管道,一旦它被爬行(工作)就从表中删除url.换句话说,我想将我的数据库用作队列.我尝试了不同的方法,没有运气.
这是我的spider.py
class MySpider(scrapy.Spider):
MAX_RETRY = 10
logger = logging.getLogger(__name__)
name = 'myspider'
start_urls = [
]
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signals.spider_closed)
return spider
def __init__(self):
db = MySQLdb.connect(
user='myuser',
passwd='mypassword',
db='mydatabase',
host='myhost',
charset='utf8',
use_unicode=True
)
self.db = db
self.logger.