Scrapy：登陆+rules简要分析

最新推荐文章于 2024-05-28 08:23:57 发布

问道于旁

最新推荐文章于 2024-05-28 08:23:57 发布

阅读量3k

点赞数 1

分类专栏： # Python 文章标签： scrapy

本文链接：https://blog.csdn.net/istend/article/details/46460753

版权

Python 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

爬了好多天，今天开始做模拟登陆：
其实，模拟登陆爬取思路很简单——>首先申请一个账户，然后将浏览器登陆的过程切换成自己手动请求登陆数据，登陆成功后，保持状态，爬取需要的链接数据。
根据我的理解，大致是这样的。
发送请求（包含登陆信息）->验证，返回响应数据->接受返回数据，成功则继续爬取，失败就找找问题。

在网上只找到这个，但是尝试了一下，还是没有成功

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

class MySpider(InitSpider):
    name = 'myspider'
    allowed_domains = ['domain.com']
    login_page = 'http://www.domain.com/login'
    start_urls = ['http://www.domain.com/useful_page/',
                  'http://www.domain.com/another_useful_page/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
             callback='parse_item', follow=True),
    )

    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "Hi Herman" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            self.initialized()
        else:
            self.log("Bad times :(")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_item(self, response):

        # Scrape data from page

话说一句，此代码摘自：
http://outofmemory.cn/code-snippet/16528/scrapy-again-to-code
http://www.sharejs.com/codes/python/8544
居然一模一样。。无语了已经，虽说千古文章一大抄把，好歹注名出处。

然后试着写了下，调试半天，终于可以成功。

class testSpier(CrawlSpider):
    name = 'abc'
    allowed_domains = ['abc.com']
    start_urls = ['http://www.abc.com']

    rules = (
        Rule(LinkExtractor(allow='abc', deny='detail'),),
        Rule(LinkExtractor(allow='id=',), callback='parse_item', follow = True),
    )

    def start_requests(self):
        return [Request("http://www.abc.com", callback = self.post_login)]

    def post_login(self, response):
        return [FormRequest.from_response(response,
                            formdata = {'username': 'abc', 'password': 'abc'},
                            callback = self.after_login,
                            dont_filter = True
                            )]

    def after_login(self, response) :
        for url in self.start_urls :
            yield self.make_requests_from_url(url)

    def parse_item(self,response):
    .............省略.................