xu_xuekai的博客

专注于python爬虫的博客空间

scrapy爬虫利用selenium实现用户登录和cookie传递

scrapy爬虫利用selenium实现用户登录和cookie传递

1. 背景

  • 上篇讲解了如何在scrapy中集成selenium爬取一些特别复杂的页面(传送门:https://blog.csdn.net/zwq912318834/article/details/79773870),而事实上,在平时的爬取任务中,往往登录过程是最复杂的,其他页面相对来说比较简单。如果把过多的时间花费在破解登录上,得不偿失。
  • 一个好的思路是:首先利用selenium实现用户登录,然后将登录后的cookie保存下来,传递给scrapy使用。后面scrapy就使用这个cookie爬取网页。

2. 环境

  • python 3.6.1
  • 系统:win7
  • IDE:pycharm
  • 安装过chrome浏览器
  • 配置好chromedriver(设置好环境变量)
  • selenium 3.7.0
  • scrapy 1.4.0

3. selenium和scrapy中cookie形态对比

  • 以百度云俱乐部为例

3.1. scrapy中的cookie

  • 当用户登录成功之后,向百度云俱乐部主页发起请求,然后在parse()方法中,取出request请求中携带的cookie信息:
# 检查登录结果
    def parseLoginResPage(self, response):
        # 查看登录结果
        print(f"parseLoginResPage: statusCode = {response.status}, url = {response.url}")
        print(f"text = {response.text}")
        # 登录成功之后,访问百度云俱乐部主页
        yield scrapy.Request(
            url="http://www.51baiduyun.com/",
            headers=self.headerData,
            callback=self.parse,
            dont_filter=True,  # 防止页面因为重复爬取,被过滤了
        )

    # 正常的分析页面请求
    def parse(self, response):
        print(f"parse: url = {response.url}, meta = {response.meta}")
        # 获取请求request中的Cookie,也就是携带给网站的cookie信息
        Cookie = response.request.headers.getlist('Cookie')
        print(f'parse:  After login CookieReq = {Cookie}')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • scrapy中cookie的形态为:
# E:\Miniconda\Lib\site-packages\scrapy\downloadermiddlewares\cookies.py 
# 可以看到引用的是 from scrapy.http.cookies import CookieJar
parse:  After login CookieReq = [b'L3em_2132_saltkey=gSZXPVeG; L3em_2132_lastvisit=1523251829; L3em_2132_sid=Ir7Mht; L3em_2132_lastact=1523255437%09member.php%09logging; L3em_2132_seccode=85931.874f29c987d4fb59f3; L3em_2132_ulastactivity=d666YmbdpNj9Iz%2FNUEi%2BjDvc4WOWgYaPjSlfz9WctVMX7egl2vDA; L3em_2132_auth=8d0cjiUMrZ3s55Jt%2B4ypshxBHoUuNN4Z5e3ExPbzViN5lFcOjNWxZ8sz8vaBOTzYEIK7AHENUH%2F%2Fcw6VVnzLC%2BFvOa12; L3em_2132_lastcheckfeed=1315026%7C1523255437; L3em_2132_checkfollow=1; L3em_2132_lip=183.12.51.62%2C1523255149; L3em_2132_security_cookiereport=c346vrKUIfmjcBwk7YP92rl%2FHJROM1lF0Y2knuvE1PvPfZOvnxad']
  • 1
  • 2
  • 3

3.2. selenium中的cookie

  • 当用selenium登录成功之后,获取其中的cookie值,如下:
seleniumCookies = spider.browser.get_cookies()
print(f"seleniumCookies = {seleniumCookies}")
  • 1
  • 2
  • selenium中cookie的形态为:
seleniumCookies = [{'domain': 'www.51baiduyun.com', 'expiry': 1538989361, 'httpOnly': False, 'name': 'CNZZDATA1253365484', 'path': '/', 'secure': False, 'value': '964419069-1523259525-%7C1523259525'}, {'domain': 'www.51baiduyun.com', 'expiry': 1525856539.733429, 'httpOnly': True, 'name': 'L3em_2132_saltkey', 'path': '/', 'secure': False, 'value': 'uL0UL77j'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523307758.631004, 'httpOnly': False, 'name': 'L3em_2132_security_cookiereport', 'path': '/', 'secure': False, 'value': '6bd1%2FSD%2F0OzhXwpZ5fhpBFDHH1WGRAslxA8eGAjOvYKJjvJkwLkc'}, {'domain': 'www.51baiduyun.com', 'expiry': 1525856539.733484, 'httpOnly': False, 'name': 'L3em_2132_lastvisit', 'path': '/', 'secure': False, 'value': '1523261207'}, {'domain': 'www.51baiduyun.com', 'httpOnly': False, 'name': 'L3em_2132_seccode', 'path': '/', 'secure': False, 'value': '120125.68ba4641e97556392b'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523350961.711943, 'httpOnly': False, 'name': 'L3em_2132_sid', 'path': '/', 'secure': False, 'value': 'mBP4sb'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523264840.028978, 'httpOnly': False, 'name': 'L3em_2132_sendmail', 'path': '/', 'secure': False, 'value': '1'}, {'domain': '.51baiduyun.com', 'expiry': 1538989340, 'httpOnly': False, 'name': 'UM_distinctid', 'path': '/', 'secure': False, 'value': '162a9a44e0823b-098677e48fe2be-454c092b-1fa400-162a9a44e094bf'}, {'domain': '.www.51baiduyun.com', 'expiry': 1554800561, 'httpOnly': False, 'name': 'Hm_lvt_79316e5471828e6e10f2df47721ce150', 'path': '/', 'secure': False, 'value': '1523264541'}, {'domain': 'www.51baiduyun.com', 'expiry': 1538989361, 'httpOnly': False, 'name': 'CNZZDATA1253863031', 'path': '/', 'secure': False, 'value': '1393313043-1523261609-%7C1523261609'}, {'domain': '.www.51baiduyun.com', 'expiry': 1554800561, 'httpOnly': False, 'name': 'Hm_lvt_eaefab1768d285abfc718a706c1164f3', 'path': '/', 'secure': False, 'value': '1523264541'}, {'domain': 'www.51baiduyun.com', 'expiry': 1554800558.630797, 'httpOnly': False, 'name': 'L3em_2132_ulastactivity', 'path': '/', 'secure': False, 'value': 'e52eGQjsi80DLGLXvdzm1z0xQ7lmIKuBlBUK8mQlJmAMXr7Ep8D8'}, {'domain': 'www.51baiduyun.com', 'httpOnly': True, 'name': 'L3em_2132_auth', 'path': '/', 'secure': False, 'value': 'be395ZoslCjexHStJKSaOCgvl9krhLvGLWmNm4hRKMH1qZ65gGUlWA5q9KV7veHBRF6hrQxqUiINkF844oiL5hukCNMg'}, {'domain': 'www.51baiduyun.com', 'expiry': 1554800558.630948, 'httpOnly': False, 'name': 'L3em_2132_lastcheckfeed', 'path': '/', 'secure': False, 'value': '2533730%7C1523264825'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523264588.630963, 'httpOnly': False, 'name': 'L3em_2132_checkfollow', 'path': '/', 'secure': False, 'value': '1'}, {'domain': 'www.51baiduyun.com', 'httpOnly': False, 'name': 'L3em_2132_lip', 'path': '/', 'secure': False, 'value': '183.12.51.62%2C1523264610'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523264591.846338, 'httpOnly': False, 'name': 'L3em_2132_checkpm', 'path': '/', 'secure': False, 'value': '1'}, {'domain': '.www.51baiduyun.com', 'httpOnly': False, 'name': 'Hm_lpvt_79316e5471828e6e10f2df47721ce150', 'path': '/', 'secure': False, 'value': '1523264562'}, {'domain': '.www.51baiduyun.com', 'httpOnly': False, 'name': 'Hm_lpvt_eaefab1768d285abfc718a706c1164f3', 'path': '/', 'secure': False, 'value': '1523264562'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523350961.982766, 'httpOnly': False, 'name': 'L3em_2132_lastact', 'path': '/', 'secure': False, 'value': '1523264829%09misc.php%09patch'}]
  • 1
  • 结论:通过对比发现,两者使用cookie的形态是不一样的,需要将selenium的cookie转化成scrapy的那种格式,才能在scrapy中进行使用。

4. 登录前和登陆后状态对比

4.1. 登录前

    # 爬虫运行的起始位置
    def start_requests(self):
        print("start baiduyun clawer")
        # 生成request时,将是否使用selenium下载的标记,放入到meta中
        yield scrapy.Request(
            url="http://www.51baiduyun.com/home.php?mod=space&do=notice&view=interactive",
            # 不允许页面跳转来测试
            meta={'usedSelenium': False, 'dont_redirect': True},
            callback=self.parseStatusRes,
            errback=self.errorHandle
        )

    def parseStatusRes(self, response):
        print(f"parseStatusRes: statusCode = {response.status}, url = {response.url}")
        print(f"text = {response.text}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 运行结果:
start baiduyun clawer
2018-04-09 16:30:57 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.51baiduyun.com/home.php?mod=space&do=notice&view=interactive> (referer: None)
request error: <302 http://www.51baiduyun.com/home.php?mod=space&do=notice&view=interactive>
  • 1
  • 2
  • 3

4.2. 登录后

2018-04-09 18:30:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/home.php?mod=space&do=notice&view=interactive>
  • 1

5. 代码详细解析

  • 在settings.py中,配置好selenium参数:
# 文件settings.py中

# ----------- selenium参数配置 -------------
SELENIUM_TIMEOUT = 25           # selenium浏览器的超时时间,单位秒
LOAD_IMAGE = True               # 是否下载图片
WINDOW_HEIGHT = 900             # 浏览器窗口大小
WINDOW_WIDTH = 900
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 在spider中,生成request时,标记哪些请求需要走selenium下载:
# 文件mySpider.py中

import scrapy
import datetime
import re
import random
from PIL import Image

# selenium相关库
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

# scrapy 信号相关库
from scrapy.utils.project import get_project_settings
from scrapy import signals

# 下面这种方式,即将废弃,所以不用
# from scrapy.xlib.pydispatch import dispatcher
# scrapy最新采用的方案
from pydispatch import dispatcher

class mySpider(CrawlSpider):
    name = 'baiduyun'
    allowed_domains = ['51baiduyun.com']
    host = "http://www.51baiduyun.com/"

    custom_settings = {
        'LOG_LEVEL':'INFO',
        'DOWNLOAD_DELAY': 1,
        'COOKIES_ENABLED': False,  # enabled by default
        'DOWNLOADER_MIDDLEWARES': {
            # 代理中间件
            'mySpider.middlewares.ProxiesMiddleware': 400,
            # SeleniumMiddleware 中间件
            'mySpider.middlewares.SeleniumMiddleware': 543,
            # 将scrapy默认的user-agent中间件关闭
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
        },

    # 将chrome初始化放到spider中,成为spider中的元素
    def __init__(self, timeout=30, isLoadImage=True, windowHeight=None, windowWidth=None):
        # 从settings.py中获取设置参数
        self.mySetting = get_project_settings()
        self.timeout = self.mySetting['SELENIUM_TIMEOUT']
        self.isLoadImage = self.mySetting['LOAD_IMAGE']
        self.windowHeight = self.mySetting['WINDOW_HEIGHT']
        self.windowWidth = self.mySetting['windowWidth']
        # 初始化chrome对象
        self.browser = webdriver.Chrome()
        if self.windowHeight and self.windowWidth:
            self.browser.set_window_size(900, 900)
        self.browser.set_page_load_timeout(self.timeout)  # 页面加载超时时间
        self.wait = WebDriverWait(self.browser, 25)  # 指定元素加载超时时间
        super(mySpider, self).__init__()
        # 设置信号量,当收到spider_closed信号时,调用mySpiderCloseHandle方法,关闭chrome
        dispatcher.connect(receiver=self.mySpiderCloseHandle,
                           signal=signals.spider_closed
                           )

    # 信号量处理函数:关闭chrome浏览器
    def mySpiderCloseHandle(self, spider):
        print(f"mySpiderCloseHandle: enter ")
        self.browser.quit()

    # 爬虫运行的起始位置
    def start_requests(self):
        print("start baiduyun clawer")
        # 生成request时,将是否使用selenium下载的标记,放入到meta中
        yield scrapy.Request(
            # 听众交互页
            url="http://www.51baiduyun.com/home.php?mod=space&do=notice&view=interactive",
            meta={'usedSelenium': True, 'pageType': 'login'},
            callback=self.parseLoginRes,
            errback=self.errorHandle
        )


    # 用于接收登录结果
    def parseLoginRes(self, response):
        print(f"parseLoginRes: statusCode = {response.status}, url = {response.url}")
        print(f"parseLoginRes: cookies1 = {response.request.cookies}")
        print(f"parseLoginRes: cookies2 = {response.request.headers.getlist('Cookie')}")
        # 登录之后,用下面这个“个人资料页”,来进行测试用户是否登录成功
        yield scrapy.Request(
            # 个人资料页
            url="http://www.51baiduyun.com/home.php?mod=spacecp&ac=profile",
            # 不允许页面跳转来测试
            meta={'usedSelenium': False, 'dont_redirect': True},
            callback=self.parseLoginStatusRes,
            errback=self.errorHandle,
            dont_filter=True,
        )


    # 用于分析登录结果
    def parseLoginStatusRes(self, response):
        print(f"parseLoginStatusRes: statusCode = {response.status}, url = {response.url}")
        print(f"parseLoginStatusRes: cookies1 = {response.request.cookies}")
        print(f"parseLoginStatusRes: cookies2 = {response.request.headers.getlist('Cookie')}")
        # 获取服务器返回过来的Cookie,也就是网站携带给用户的cookie信息
        responseCookie = response.headers.getlist('Set-Cookie')
        print(f"parseLoginStatusRes: response.cookie = {responseCookie}")
        print(f"############################################")
        print(f"text = {response.text}")


    # 请求错误处理:可以打印,写文件,或者写到数据库中
    def errorHandle(self, failure):
        print(f"request error: {failure.value.response}")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 在下载中间件middlewares.py中,使用selenium进行用户登录,获取cookie,并传递给scrapy:
# 文件middlewares.py中

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
import random
from scrapyFengniao import headDefine

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from scrapy.http import HtmlResponse
import time

class SeleniumMiddleware():
    # Middleware中会传递进来一个spider,这就是我们的spider对象,从中可以获取__init__时的chrome相关元素
    def process_request(self, request, spider):
        '''
        用chrome抓取页面
        :param request: Request请求对象
        :param spider: Spider对象
        :return: HtmlResponse响应
        '''
        print(f"now is using chrome to get page ...")
        # 依靠meta中的标记,来决定是否需要使用selenium来爬取
        usedSelenium = request.meta.get('usedSelenium', False)
        if usedSelenium:
            if request.meta.get('pageType', '') == 'login':
                # 先存储原始的url链接
                originalUrl = request.url
                try:
                    # 会自动跳转到登录页面
                    spider.browser.get(originalUrl)
                    # 用户名登录框是否出现
                    usernameInput = spider.wait.until(
                        EC.presence_of_element_located((By.XPATH, "//div[@id='messagelogin']//input[@name='username']"))
                    )
                    time.sleep(2)
                    usernameInput.clear()
                    usernameInput.send_keys("ancoxxxxxxx")   # 输入用户名

                    passWordElem = spider.browser.find_element_by_xpath("//div[@id='messagelogin']//input[@name='password']")
                    time.sleep(2)
                    passWordElem.clear()
                    passWordElem.send_keys("anco00000000")        # 输入密码

                    captchaElem = spider.browser.find_element_by_xpath("//div[@id='messagelogin']//input[@name='seccodeverify']")
                    time.sleep(2)
                    captchaElem.clear()
                    # 此处采用手动输入
                    # 关于自动打码,可以参考之前写过的文章,链接如下:
                    # https://blog.csdn.net/zwq912318834/article/details/78616462
                    captcha = input("输入验证码\n>").strip()
                    captchaElem.send_keys(captcha)          # 输入验证码

                    # 点击登录按钮
                    loginButtonElem = spider.browser.find_element_by_xpath("//div[@id='messagelogin']//button[@name='loginsubmit']")
                    time.sleep(2)
                    loginButtonElem.click()
                    time.sleep(1)
                    seleniumCookies = spider.browser.get_cookies()
                    print(f"seleniumCookies = {seleniumCookies}")
                    # # 查看搜索结果是否出现
                    # searchRes = spider.wait.until(
                    #     EC.presence_of_element_located((By.XPATH, "//div[@id='resultsCol']"))
                    # )
                except Exception as e:
                    print(f"chrome user login handle error, Exception = {e}")
                    return HtmlResponse(url=request.url, status=500, request=request)
                else:
                    time.sleep(3)
                    # 登录成功之后,获取到selenium的cookie
                    cookie = [item["name"] + ":" + item["value"] for item in seleniumCookies]
                    cookMap = {}
                    for elem in cookie:
                        str = elem.split(':')
                        cookMap[str[0]] = str[1]
                    print(f"cookMap = {cookMap}")
                    # 中间件,对Request进行加工
                    # 开始用这个转换后的cookie重新构造Request,从源码中来看Request构造的原型
                    # E:\Miniconda\Lib\site-packages\scrapy\http\request\__init__.py
                    request.cookies = cookMap  # 让这个带有登录后cookie的Request继续爬取
                    request.meta['usedSelenium'] = False  # 避免这个url发生重定向302,里面的meta信息会让它回到这个流程
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 运行结果:
2018-04-09 18:30:20 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:62535/session/6498d461224b10330c9c9b8de2e1d36f/element/0.7256473494790106-3/clear {"id": "0.7256473494790106-3", "sessionId": "6498d461224b10330c9c9b8de2e1d36f"}
2018-04-09 18:30:20 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
输入验证码
>CJV6
2018-04-09 18:30:30 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:62535/session/6498d461224b10330c9c9b8de2e1d36f/element/0.7256473494790106-3/value {"text": "CJV6", "value": ["C", "J", "V", "6"], "id": "0.7256473494790106-3", "sessionId": "6498d461224b10330c9c9b8de2e1d36f"}
2018-04-09 18:30:30 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2018-04-09 18:30:30 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:62535/session/6498d461224b10330c9c9b8de2e1d36f/element {"using": "xpath", "value": "//div[@id='messagelogin']//button[@name='loginsubmit']", "sessionId": "6498d461224b10330c9c9b8de2e1d36f"}
2018-04-09 18:30:30 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2018-04-09 18:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:62535/session/6498d461224b10330c9c9b8de2e1d36f/element/0.7256473494790106-4/click {"id": "0.7256473494790106-4", "sessionId": "6498d461224b10330c9c9b8de2e1d36f"}
2018-04-09 18:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2018-04-09 18:30:33 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:62535/session/6498d461224b10330c9c9b8de2e1d36f/cookie {"sessionId": "6498d461224b10330c9c9b8de2e1d36f"}
2018-04-09 18:30:33 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
seleniumCookies = [{'domain': 'www.51baiduyun.com', 'expiry': 1538994613, 'httpOnly': False, 'name': 'CNZZDATA1253365484', 'path': '/', 'secure': False, 'value': '1182144396-1523264925-%7C1523264925'}, {'domain': 'www.51baiduyun.com', 'expiry': 1525861813.314328, 'httpOnly': True, 'name': 'L3em_2132_saltkey', 'path': '/', 'secure': False, 'value': 'EIPRiyLR'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523313033.124556, 'httpOnly': False, 'name': 'L3em_2132_security_cookiereport', 'path': '/', 'secure': False, 'value': 'f6798nBoSK9dYFYBDvIC%2FBvfcUL8e1yO9Ca4IsGCwP%2BI8S7QQqwU'}, {'domain': 'www.51baiduyun.com', 'expiry': 1525861813.314366, 'httpOnly': False, 'name': 'L3em_2132_lastvisit', 'path': '/', 'secure': False, 'value': '1523266481'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523356233.124391, 'httpOnly': False, 'name': 'L3em_2132_sid', 'path': '/', 'secure': False, 'value': 'Vq8I3Y'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523270113.613524, 'httpOnly': False, 'name': 'L3em_2132_sendmail', 'path': '/', 'secure': False, 'value': '1'}, {'domain': '.51baiduyun.com', 'expiry': 1538994613, 'httpOnly': False, 'name': 'UM_distinctid', 'path': '/', 'secure': False, 'value': '162a9f4c56f26e-077863ee37fb44-454c092b-1fa400-162a9f4c570746'}, {'domain': '.www.51baiduyun.com', 'expiry': 1554805813, 'httpOnly': False, 'name': 'Hm_lvt_79316e5471828e6e10f2df47721ce150', 'path': '/', 'secure': False, 'value': '1523269814'}, {'domain': 'www.51baiduyun.com', 'expiry': 1538994613, 'httpOnly': False, 'name': 'CNZZDATA1253863031', 'path': '/', 'secure': False, 'value': '709519930-1523267010-%7C1523267010'}, {'domain': '.www.51baiduyun.com', 'httpOnly': False, 'name': 'Hm_lpvt_79316e5471828e6e10f2df47721ce150', 'path': '/', 'secure': False, 'value': '1523269814'}, {'domain': '.www.51baiduyun.com', 'expiry': 1554805814, 'httpOnly': False, 'name': 'Hm_lvt_eaefab1768d285abfc718a706c1164f3', 'path': '/', 'secure': False, 'value': '1523269814'}, {'domain': '.www.51baiduyun.com', 'httpOnly': False, 'name': 'Hm_lpvt_eaefab1768d285abfc718a706c1164f3', 'path': '/', 'secure': False, 'value': '1523269814'}, {'domain': 'www.51baiduyun.com', 'httpOnly': False, 'name': 'L3em_2132_seccode', 'path': '/', 'secure': False, 'value': '134130.4414113946898638ef'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523356233.124305, 'httpOnly': False, 'name': 'L3em_2132_lastact', 'path': '/', 'secure': False, 'value': '1523270100%09member.php%09logging'}, {'domain': 'www.51baiduyun.com', 'expiry': 1554805833.124362, 'httpOnly': False, 'name': 'L3em_2132_ulastactivity', 'path': '/', 'secure': False, 'value': '5a0fvYvq4G%2FGIxWgyuIUKNOIje7On8j5d%2BKVSIvQ3qvuJJQT8Enc'}, {'domain': 'www.51baiduyun.com', 'httpOnly': True, 'name': 'L3em_2132_auth', 'path': '/', 'secure': False, 'value': 'a975vSsvXdujotA8VvK1mFvDDnG5vVouJnZXWQYlMoaBEdK0ztntNB4o%2BqEtN73MzFGTusQgmv9kry81x%2BP4ilzwZ4F7'}, {'domain': 'www.51baiduyun.com', 'expiry': 1554805833.124509, 'httpOnly': False, 'name': 'L3em_2132_lastcheckfeed', 'path': '/', 'secure': False, 'value': '2533730%7C1523270100'}, {'domain': 'www.51baiduyun.com', 'expiry': 1523269863.124527, 'httpOnly': False, 'name': 'L3em_2132_checkfollow', 'path': '/', 'secure': False, 'value': '1'}, {'domain': 'www.51baiduyun.com', 'httpOnly': False, 'name': 'L3em_2132_lip', 'path': '/', 'secure': False, 'value': '183.12.51.62%2C1523268432'}]
cookMap = {'CNZZDATA1253365484': '1182144396-1523264925-%7C1523264925', 'L3em_2132_saltkey': 'EIPRiyLR', 'L3em_2132_security_cookiereport': 'f6798nBoSK9dYFYBDvIC%2FBvfcUL8e1yO9Ca4IsGCwP%2BI8S7QQqwU', 'L3em_2132_lastvisit': '1523266481', 'L3em_2132_sid': 'Vq8I3Y', 'L3em_2132_sendmail': '1', 'UM_distinctid': '162a9f4c56f26e-077863ee37fb44-454c092b-1fa400-162a9f4c570746', 'Hm_lvt_79316e5471828e6e10f2df47721ce150': '1523269814', 'CNZZDATA1253863031': '709519930-1523267010-%7C1523267010', 'Hm_lpvt_79316e5471828e6e10f2df47721ce150': '1523269814', 'Hm_lvt_eaefab1768d285abfc718a706c1164f3': '1523269814', 'Hm_lpvt_eaefab1768d285abfc718a706c1164f3': '1523269814', 'L3em_2132_seccode': '134130.4414113946898638ef', 'L3em_2132_lastact': '1523270100%09member.php%09logging', 'L3em_2132_ulastactivity': '5a0fvYvq4G%2FGIxWgyuIUKNOIje7On8j5d%2BKVSIvQ3qvuJJQT8Enc', 'L3em_2132_auth': 'a975vSsvXdujotA8VvK1mFvDDnG5vVouJnZXWQYlMoaBEdK0ztntNB4o%2BqEtN73MzFGTusQgmv9kry81x%2BP4ilzwZ4F7', 'L3em_2132_lastcheckfeed': '2533730%7C1523270100', 'L3em_2132_checkfollow': '1', 'L3em_2132_lip': '183.12.51.62%2C1523268432'}
2018-04-09 18:30:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/home.php?mod=space&do=notice&view=interactive> (referer: None)
parseLoginRes: statusCode = 200, url = http://www.51baiduyun.com/home.php?mod=space&do=notice&view=interactive
parseLoginRes: cookies1 = {'CNZZDATA1253365484': '1182144396-1523264925-%7C1523264925', 'L3em_2132_saltkey': 'EIPRiyLR', 'L3em_2132_security_cookiereport': 'f6798nBoSK9dYFYBDvIC%2FBvfcUL8e1yO9Ca4IsGCwP%2BI8S7QQqwU', 'L3em_2132_lastvisit': '1523266481', 'L3em_2132_sid': 'Vq8I3Y', 'L3em_2132_sendmail': '1', 'UM_distinctid': '162a9f4c56f26e-077863ee37fb44-454c092b-1fa400-162a9f4c570746', 'Hm_lvt_79316e5471828e6e10f2df47721ce150': '1523269814', 'CNZZDATA1253863031': '709519930-1523267010-%7C1523267010', 'Hm_lpvt_79316e5471828e6e10f2df47721ce150': '1523269814', 'Hm_lvt_eaefab1768d285abfc718a706c1164f3': '1523269814', 'Hm_lpvt_eaefab1768d285abfc718a706c1164f3': '1523269814', 'L3em_2132_seccode': '134130.4414113946898638ef', 'L3em_2132_lastact': '1523270100%09member.php%09logging', 'L3em_2132_ulastactivity': '5a0fvYvq4G%2FGIxWgyuIUKNOIje7On8j5d%2BKVSIvQ3qvuJJQT8Enc', 'L3em_2132_auth': 'a975vSsvXdujotA8VvK1mFvDDnG5vVouJnZXWQYlMoaBEdK0ztntNB4o%2BqEtN73MzFGTusQgmv9kry81x%2BP4ilzwZ4F7', 'L3em_2132_lastcheckfeed': '2533730%7C1523270100', 'L3em_2132_checkfollow': '1', 'L3em_2132_lip': '183.12.51.62%2C1523268432'}
parseLoginRes: cookies2 = [b'CNZZDATA1253365484=1182144396-1523264925-%7C1523264925; L3em_2132_saltkey=EIPRiyLR; L3em_2132_security_cookiereport=f6798nBoSK9dYFYBDvIC%2FBvfcUL8e1yO9Ca4IsGCwP%2BI8S7QQqwU; L3em_2132_lastvisit=1523266481; L3em_2132_sid=Vq8I3Y; L3em_2132_sendmail=1; UM_distinctid=162a9f4c56f26e-077863ee37fb44-454c092b-1fa400-162a9f4c570746; Hm_lvt_79316e5471828e6e10f2df47721ce150=1523269814; CNZZDATA1253863031=709519930-1523267010-%7C1523267010; Hm_lpvt_79316e5471828e6e10f2df47721ce150=1523269814; Hm_lvt_eaefab1768d285abfc718a706c1164f3=1523269814; Hm_lpvt_eaefab1768d285abfc718a706c1164f3=1523269814; L3em_2132_seccode=134130.4414113946898638ef; L3em_2132_lastact=1523270100%09member.php%09logging; L3em_2132_ulastactivity=5a0fvYvq4G%2FGIxWgyuIUKNOIje7On8j5d%2BKVSIvQ3qvuJJQT8Enc; L3em_2132_auth=a975vSsvXdujotA8VvK1mFvDDnG5vVouJnZXWQYlMoaBEdK0ztntNB4o%2BqEtN73MzFGTusQgmv9kry81x%2BP4ilzwZ4F7; L3em_2132_lastcheckfeed=2533730%7C1523270100; L3em_2132_checkfollow=1; L3em_2132_lip=183.12.51.62%2C1523268432']
now is using chrome to get page ...
2018-04-09 18:30:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.51baiduyun.com/home.php?mod=spacecp&ac=profile> (referer: http://www.51baiduyun.com/home.php?mod=space&do=notice&view=interactive)
parseLoginStatusRes: statusCode = 200, url = http://www.51baiduyun.com/home.php?mod=spacecp&ac=profile
parseLoginStatusRes: cookies1 = {}
parseLoginStatusRes: cookies2 = [b'CNZZDATA1253365484=1182144396-1523264925-%7C1523264925; L3em_2132_saltkey=EIPRiyLR; L3em_2132_security_cookiereport=f6798nBoSK9dYFYBDvIC%2FBvfcUL8e1yO9Ca4IsGCwP%2BI8S7QQqwU; L3em_2132_lastvisit=1523266481; L3em_2132_sid=Vq8I3Y; L3em_2132_sendmail=1; UM_distinctid=162a9f4c56f26e-077863ee37fb44-454c092b-1fa400-162a9f4c570746; Hm_lvt_79316e5471828e6e10f2df47721ce150=1523269814; CNZZDATA1253863031=709519930-1523267010-%7C1523267010; Hm_lpvt_79316e5471828e6e10f2df47721ce150=1523269814; Hm_lvt_eaefab1768d285abfc718a706c1164f3=1523269814; Hm_lpvt_eaefab1768d285abfc718a706c1164f3=1523269814; L3em_2132_seccode=134130.4414113946898638ef; L3em_2132_lastact=1523270104%09home.php%09space; L3em_2132_ulastactivity=5a0fvYvq4G%2FGIxWgyuIUKNOIje7On8j5d%2BKVSIvQ3qvuJJQT8Enc; L3em_2132_auth=a975vSsvXdujotA8VvK1mFvDDnG5vVouJnZXWQYlMoaBEdK0ztntNB4o%2BqEtN73MzFGTusQgmv9kry81x%2BP4ilzwZ4F7; L3em_2132_lastcheckfeed=2533730%7C1523270100; L3em_2132_checkfollow=1; L3em_2132_lip=183.12.51.62%2C1523268432']
parseLoginStatusRes: response.cookie = [b'L3em_2132_lastact=1523270104%09home.php%09spacecp; expires=Tue, 10-Apr-2018 10:35:04 GMT; Max-Age=86400; path=/', b'L3em_2132_sid=Vq8I3Y; expires=Tue, 10-Apr-2018 10:35:04 GMT; Max-Age=86400; path=/', b'L3em_2132_dismobilemessage=1; expires=Mon, 09-Apr-2018 11:35:04 GMT; Max-Age=3600; path=/']
############################################
text = <!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Cache-control" content="no-cache" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no, minimum-scale=1.0, maximum-scale=1.0">
<meta name="format-detection" content="telephone=no" />
<meta name="keywords" content="" />
<meta name="description" content=",百度云俱乐部" />
<title>提示信息 -  百度云俱乐部 -  手机版 - Powered by Discuz!</title>
<link rel="stylesheet" href="static/image/mobile/style.css" type="text/css" media="all">
<script src="static/js/mobile/jquery-1.8.3.min.js?M41" type="text/javascript"></script>

<script type="text/javascript">var STYLEID = '1', STATICURL = 'static/', IMGDIR = 'static/image/common', VERHASH = 'M41', charset = 'utf-8', discuz_uid = '2533730', cookiepre = 'L3em_2132_', cookiedomain = '', cookiepath = '/', showusercard = '1', attackevasive = '0', disallowfloat = 'newthread', creditnotice = '1|积分|', defaultstyle = '', REPORTURL = 'aHR0cDovL3d3dy41MWJhaWR1eXVuLmNvbS9ob21lLnBocD9tb2Q9c3BhY2VjcCZhYz1wcm9maWxl', SITEURL = 'http://www.51baiduyun.com/', JSPATH = 'static/js/';</script>

<script src="static/js/mobile/common.js?M41" type="text/javascript" charset="utf-8"></script>
</head>

<body class="bg">

<!-- header start -->
<header class="header">
<div class="hdc cl">
<h2><a title="百度云俱乐部" href="forum.php"><img src="static/image/mobile/images/logo.png" /></a></h2>
<ul class="user_fun">
<li><a href="search.php?mod=forum" class="icon_search">搜索</a></li>
<li><a href="forum.php?forumlist=1" class="icon_threadlist">版块列表</a></li>
<li id="usermsg"><a href="home.php?mod=space&uid=2533730&do=profile&mycenter=1" class="icon_userinfo">用户信息</a></li>
</ul>
</div>
</header>
<!-- header end -->
<!-- main jump start -->
<div class="jump_c">
<p><style>body {background:#000000;height:1000px;width:auto;}</style><meta http-equiv="refresh" content="0;url=/home.php?mod=spacecp&ac=plugin&op=credit&id=hux_credit:hux_credit"></p>
    <p>
            <a href="http://www.51baiduyun.com/home.php?mod=spacecp&ac=profile&mobile=no" class="mtn">继续访问</a><br />
            <a href="javascript:history.back();">返回上一页</a>
        </p>
    <p><a class="grey" href="javascript:history.back();">[ 点击这里返回上一页 ]</a></p>
</div>
<!-- main jump end -->


<div id="mask" style="display:none;"></div>
<div class="footer">
<div>
<a href="home.php?mod=space&amp;uid=2533730&amp;do=profile&amp;mycenter=1">ancode2017</a> , <a href="member.php?mod=logging&amp;action=logout&amp;formhash=4f1a511e" title="退出" class="dialog">退出</a>
</div>
    <div>
<a href="http://www.51baiduyun.com/home.php?mod=spacecp&ac=profile&mobile=1&simpletype=no">标准版</a> |  
<a href="javascript:;" style="color:#D7D7D7;">触屏版</a> | 
<a href="http://www.51baiduyun.com/home.php?mod=spacecp&ac=profile&mobile=no">电脑版</a> | 
<a href="http://www.discuz.net/mobile.php?platform=android">客户端</a>    </div>
<p>&copy; Comsenz Inc.</p>
</div>

<span style="display:none"><script src="http://s95.cnzz.com/z_stat.php?id=1253863031&web_id=1253863031" type="text/javascript" language="JavaScript"></script></span>

<script>
var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "//hm.baidu.com/hm.js?eaefab1768d285abfc718a706c1164f3";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
</script>

</body>
</html>
mySpiderCloseHandle: enter 
2018-04-09 18:30:36 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-09 18:30:36 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:62535/session/6498d461224b10330c9c9b8de2e1d36f {"sessionId": "6498d461224b10330c9c9b8de2e1d36f"}
2018-04-09 18:30:36 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2018-04-09 18:30:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2734,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 8194,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 9, 10, 30, 36, 893602),
 'log_count/DEBUG': 37,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 4, 9, 10, 30, 12, 856602)}
2018-04-09 18:30:39 [scrapy.core.engine] INFO: Spider closed (finished)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119

这里写图片描述 
这里写图片描述 
这里写图片描述

阅读更多
个人分类: python爬虫
想对作者说点什么? 我来说一句

python+scrapy+selenium爬虫

python配置

uselym uselym

2016-09-13 13:19:12

阅读数:11349

没有更多推荐了,返回首页

不良信息举报

scrapy爬虫利用selenium实现用户登录和cookie传递

最多只允许输入30个字

加入CSDN,享受更精准的内容推荐,与500万程序员共同成长!
关闭
关闭