Scrapy爬虫笔记-未完成

最新推荐文章于 2022-03-13 16:06:24 发布

sniper24

最新推荐文章于 2022-03-13 16:06:24 发布

阅读量594

点赞数

分类专栏： Python编程

本文链接：https://blog.csdn.net/sniper24/article/details/50617462

版权

Python编程专栏收录该内容

9 篇文章 0 订阅

订阅专栏

本文介绍了如何启动Scrapy爬虫，包括命令行方式和API在脚本中的使用。讨论了XPath定位技巧，并提到了Firefox和Chrome的辅助工具。此外，文章还涉及了登录爬取的策略、随机User-agent的设置以及通过下载器中间件实现。最后，讲解了如何将数据存储到MySQL数据库，包括SQLStorePipeline的使用。

摘要由CSDN通过智能技术生成

启动Scrapy爬虫
除了常用的 scrapy crawl 来启动Scrapy，您也可以使用 API 在脚本中启动Scrapy。
XPath 定位
Firebug(Firefox插件)
可以使用Chrome的XPath helper
firefox上的若干插件
关于登陆爬取
http://outofmemory.cn/code-snippet/16528/scrapy-again-to-code
随机User-agent
设置下载器中间件（DownloadMiddleWare）
关于数据库存储（以MySQL为例）

Cannot use this to create the table, must have table already created

from twisted.enterprise import adbapi
import datetime
import MySQLdb.cursors

class SQLStorePipeline(object):

def __init__(self):
    self.dbpool = adbapi.ConnectionPool('MySQLdb', db='mydb',
            user='myuser', passwd='mypass', cursorclass=MySQLdb.cursors.DictCursor,
            charset='utf8', use_unicode=True)

def process_item(self, item, spider):
    # run db query in thread pool
    query = self.dbpool.runInteraction(self._conditional_insert, item)
    query.addErrback(self.handle_error)

    return item

def _conditional_insert(self, tx, item):
    # create record if doesn't exist. 
    # all this block run on it's own thread
    tx.execute("select * from websites where link = %s", (item['link'][0], ))
    result = tx.fetchone()
    if result:
        log.msg("Item already stored in db: %s" % item, level=log.DEBUG)
    else:
        tx.execute(\
            "insert into websites (link, created) "
            "values (%s, %s)",
            (item['link'][0],
             datetime.datetime.now())
        )
        log.msg("Item stored in db: %s" % item, level=log.DEBUG)

def handle_error(self, e):
    log.err(e)