Scrapy 使用写死的cookie 来爬需要登录的页面

最新推荐文章于 2024-07-07 21:04:35 发布

瓦力冫

最新推荐文章于 2024-07-07 21:04:35 发布

阅读量5.2k

点赞数

分类专栏： scrapy

本文链接：https://blog.csdn.net/fox64194167/article/details/79775301

版权

scrapy 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

1. 流程

1.1先用浏览器chrome来访问目标地址，手动输入账号和密码，登录后，查看cookie，然后放到请求里面

2. 优缺点

优点：

1.可以跳过验证码

2.不需要写登录的内容

缺点：

1.cookie有些网站容易过期

2.每次带cookie请求增加网络带宽

所以说适合小网站短时间操作

3.获取cookie

用chrome浏览器，浏览目标地址，我这边是https://mp.csdn.net/postlist/list/all 就是csdn的博客后台。 chrome浏览器有个开发者工具，打开后，在network里就可以看到所有的请求了，需要重新载入一次。然后那一大坨的就是cookie了，格式是xx=yy;xx=yy这样的。

但是scrapy中使用的格式是这样的，字典格式

{'xx':'yy','xx':'yy'}

所以这样用工具代码把上面的转换下。

# -*- coding: utf-8 -*-

class transCookie:
  def __init__(self, cookie):
     self.cookie = cookie

  def stringToDict(self):

     itemDict = {}
     items = self.cookie.split(';')
     for item in items:
        key = item.split('=')[0].replace(' ', '')
        value = item.split('=')[1]
        itemDict[key] = value
     return itemDict

if __name__ == "__main__":
  cookie = "你自己的cookie"
  trans = transCookie(cookie)
  print (trans.stringToDict())

复制那一大坨到这里，然后运行下，去输出界面里复制字典格式。

4. 开始写爬虫带cookie

import scrapy

from tutorial.items import CSDNItem
import logging


class CSDNSpider(scrapy.Spider):
    name = "csdn"

    def start_requests(self):
        start_url = 'https://mp.csdn.net/postlist/list/all/'
        cookie = {}
        headers = {
            'Connection' : 'keep - alive',
            'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'
        }
        yield scrapy.Request(url=start_url,headers=headers,cookies=cookie)
        #yield scrapy.Request(url=start_url, headers=headers)

    def parseDetail(self, response):
        item = CSDNItem()
        item['title'] = response.css('.csdn_top::text').extract_first()
        item['body'] = response.css('#article_content .htmledit_views').extract_first()
        yield item
    def parse(self, response):


        for article in response.css('.list-item-title .article-list-item-txt'):
            articleId = article.css('a::attr("href")').extract_first()
            if articleId is not None:
                articleId = str(articleId)
                articleId = articleId[articleId.rfind("/") + 1: len(articleId)]
                next_page = 'https://blog.csdn.net/fox64194167/article/details/%s' % articleId
                yield response.follow(next_page, self.parseDetail)


        bottomNavNum = response.css('.page-item.active a::text').extract_first()
        logging.info(int(bottomNavNum))

        if bottomNavNum is not None:
            next_page = ('https://mp.csdn.net/postlist/list/all/%d' % (int(bottomNavNum) + 1))
            logging.info('next_page:' + next_page)
            yield response.follow(next_page, self.parse)

注意一下我这个代码现在cookie是空的，要加上上一步中的那个字典格式的cookie，然后这边我详细的文章地址是https://blog.csdn.net/fox64194167/article/details/3333 这样的，你要自己修改。