利用cookie登录网站通用模块

大尾巴鱼_root

已于 2022-08-31 10:30:39 修改

阅读量935

点赞数

分类专栏： python爬虫技术文章标签： python 开发语言

于 2022-07-21 11:35:06 首次发布

本文链接：https://blog.csdn.net/weixin_41936572/article/details/125908416

版权

python爬虫技术专栏收录该内容

3 篇文章 2 订阅

订阅专栏

环境

操作系统：windows 10 x64
集成环境:Visual Studio Code
Python版本：v3.10.5 64位
– 解码需要os、sqlite3、win32crypt、base64、cryptography、json库
– 访问网页需要urllib、http库

一、说明

本文是承接另一篇博文基于Python通过Chrome的Cookie登录百度账户的姊妹篇。在对代码进行封装后，获得了可以生产cookie的通用模组。对于一些思路的说明，可以参考上文。

二、工程文件结构

工程目录

[dir] log
[dir] modules
[dir] resource
[file] main.py

三、资源构建

1、在resource文件夹内创建配置文件config

resource

[file] config

内容如下:

{
    ".csdn.net": ["TRUE","FALSE"],
    ".www.csdn.net": ["TRUE","FALSE"]
}

2、在modules文件夹内创建支持模块conn.py、cookie.py、decode.py，_init_.py:

modules

[file] conn.py
[file] cookie.py
[file] decode.py
[file] download.py
[file] _init_.py.py

内容如下：

2.1 conn.py，继承于Connection类，加了上下文管理方便使用with调用

Conn类，调用格式为Conn(DBfile: io)。是专门用来调用SQlite数据库的，file为数据库文件路径。
使用范例：

with Conn(DBfiles) as f:
    {your scripts}

返回一个可以直接调用的游标对象。

源代码：

import sqlite3
from typing import *


class Conn(sqlite3.Connection):
    def __init__(self, path) -> None:
        super().__init__(path)
        self._path = path

    def __enter__(self) -> sqlite3.Connection.cursor:
        return self.cursor()

    def __exit__(self, exc_type, exc_value, exc_traceback) -> None:
        self.cursor().close()
        self.close()
        if exc_traceback is None:
            ...
        else:
            ...

    def __repr__(self) -> str:
        return '_conn(path = %s)' % self._path

2.2 cookie.py，用于将给的原材料加工成可以使用的cookie文件

Cookie类，调取格式为Cookie(list(tuples,...))。括号里面是用list封装的tuple，每一个tuple存储的是一条cookie对应的参数。主要使用getOutFile方法，对外输出一个cookie文件。

Cookie.getOutFile(path: str)
给定输出文件的路径和名称，用于输出cookie文件
其他方法含义和用法见源代码注释

源代码：

import json
import time
from typing import *


with open('./resource/config', 'r') as f:
    BAIDU_BOOL = json.load(f)

# cookie文件头，最后一步手工生成cookie会用得到
COOKIE_HEADER = """# Netscape HTTP Cookie File\n# http://curl.haxx.se/rfc/cookie_spec.html\n# This is a generated file!  Do not edit.\n\n"""


class Cookie:
    '''这个类可以用来生成cookie文件'''
    def __init__(self, cookieList: List[Tuple[str, str, str, str, str]]) -> None:
        self.cookieList = cookieList

    def __repr__(self) -> str:
        return 'Cookie(count=%d)' % len(self.cookieList)

    def getOne(self):
        '''此方法可以将cookie封装成生成器对象，通过next()函数调用'''
        for i in self.cookieList:
            creation_utc, host_key, name, value, path = i
            cookieContent = str(host_key) + '\t' + str(BAIDU_BOOL[host_key][0] if host_key in BAIDU_BOOL.keys() else 'FALSE') + '\t' + str(path) + '\t' + str(
                BAIDU_BOOL[host_key][1] if host_key in BAIDU_BOOL.keys() else 'FALSE') + '\t' + str(creation_utc) + '\t' + str(name) + '\t' + str(value)
            yield cookieContent

    def getAll(self):
        '''此方法可以将cookie的所有内容输出到控制台'''
        for i in self.cookieList:
            creation_utc, host_key, name, value, path = i
            cookieContent = str(host_key) + '\t' + str(BAIDU_BOOL[host_key][0] if host_key in BAIDU_BOOL.keys() else 'FALSE') + '\t' + str(path) + '\t' + str(
                BAIDU_BOOL[host_key][1] if host_key in BAIDU_BOOL.keys() else 'FALSE') + '\t' + str(creation_utc) + '\t' + str(name) + '\t' + str(value)
            print(cookieContent)

    def getOutFile(self, path: str = './log/'):
        '''此方法可以将cookie封装进文件'''
        with open(path + 'Cookie'+str(int(time.time())) + '.log', 'w') as f:
            # 写入头
            f.write(COOKIE_HEADER)
            # 写入体
            for i in self.cookieList:
                creation_utc, host_key, name, value, path = i
                cookieContent = str(host_key) + '\t' + str(BAIDU_BOOL[host_key][0] if host_key in BAIDU_BOOL.keys() else 'FALSE') + '\t' + str(path) + '\t' + str(
                    BAIDU_BOOL[host_key][1] if host_key in BAIDU_BOOL.keys() else 'FALSE') + '\t' + str(creation_utc) + '\t' + str(name) + '\t' + str(value) + '\n'
                f.write(cookieContent)

2.3 decode.py，用于解密chrome浏览器的cookie

具备两个函数，分别是用来获取密钥，以及利用密钥解密密文

DecodeKey(path: str)
获取密钥。path是需要指定的密钥文件路径。具体路径详见下文main.py的源程序说明。
DecodeValue(bytes: bytes, key: str)
解码密文。bytes是密文字节串。key就是上面获取的密钥。

源代码：

from cryptography.hazmat.primitives.ciphers.aead import AESGCM
import json
import base64
import win32crypt

def DecodeKey(path: str) -> str:
    '''本函数用于将local state中封装的密钥解密出来'''
    with open(path, 'r', encoding='utf-8') as f:
        jsonStr = json.load(f)
        encryptedKey = jsonStr['os_crypt']['encrypted_key']
    encrypted_key_with_header = base64.b64decode(encryptedKey)
    encrypted_key = encrypted_key_with_header[5:]
    return win32crypt.CryptUnprotectData(encrypted_key, None, None, None, 0)[1]

def DecodeValue(bytes: bytes, key: str) -> str:
    '''本函数用于将cookie中加密的value解密出来'''
    # 根据百度的结果，valEncrypted的构成分为三部分，都是固定长度
    nonce = bytes[3:15]
    cipherbytes = bytes[15:]

    aesgcm = AESGCM(key)
    plainbytes = aesgcm.decrypt(nonce, cipherbytes, None)

    return plainbytes.decode('utf-8')

2.4 download.py，用于下载资源

Downunit类，调用方法是Downunit(url: str, path: str, threadnum: int = 3)。url是资源网址，path是下载的文件路径和名称，threadnum是启用的线程数量，默认3。主要使用的方法是Download。

Downunit.Download()
开始下载

源代码：

import threading
from urllib.parse import *
from urllib.request import *
from typing import *


class Downunit():
    def __init__(self, url: str, path: str, threadnum: int = 3) -> None:
        self.url = url
        self.path = path
        self.threadnum = threadnum
        self.CONTENT_LENGTH = 0
        self.headers = {}

    def Download(self):
        request = Request(url=self.url, headers=self.headers)
        self.file = urlopen(request)
        self.CONTENT_LENGTH = int(dict(
            self.file.headers).get('Content-Length', 0))
        self.file.close()

        unitBox = self.CONTENT_LENGTH//self.threadnum + 1  # 每个线程所负责的下载大小

        for i in range(self.threadnum):
            os_Start = i * unitBox  # 下载的起点

            f = open(self.path, 'wb')
            f.seek(os_Start, 0)
            td = DownThread(self.url, self.headers, f, os_Start, unitBox)
            td.start()


class DownThread(threading.Thread):
    def __init__(self, url, headers, file, os_Start, unitBox) -> None:
        super().__init__()
        self.url = url
        self.headers = headers
        self.file = file
        self.os_Start = os_Start
        self.unitBox = unitBox

    def run(self):
        request = Request(url=self.url, headers=self.headers)
        f = urlopen(request)

        # 移动光标到下载开始的位置
        for i in range(self.os_Start):
            f.read(1)

        lenth = 0
        while lenth < self.unitBox:
            page = f.read(1024)
            if page is None or len(page) <= 0:
                break
            self.file.write(page)
            lenth += len(page)
        f.close()
        self.file.close()

2.5 _init_.py，头文件

from . conn import *
from . cookie import *
from . decode import *
from .download import *

四、主程序main.py的构建，最终目的是获取cookie

源代码：

from modules import *
import os


# 存储cookie、密钥的文件路径
CHROME_COOKIE_PATH = r'\Google\Chrome\User Data\Default\network\Cookies'
CHROME_LOCALSTATE_PATH = r'\Google\Chrome\User Data\Local State'

path = os.environ['LOCALAPPDATA']

# cookie文件，本质是一个sqlite数据库
dbCookies = path + CHROME_COOKIE_PATH  

# 存储密钥的文件，本质是一个Json文件
fileLocalState = path + CHROME_LOCALSTATE_PATH  

# 通过模糊搜索csdn寻找cookie
KEY_WORD = r'%csdn%'  #  <----重要：根据需要设置关键词。比如要获取百度cookie，可以改成 %baidu% -----

with Conn(dbCookies) as cur:
    sql = """select creation_utc,host_key,name,encrypted_value,path from cookies where host_key like '%s'""" % KEY_WORD
    cur.execute(sql)
    valCookiesWithEncode = cur.fetchall()

# 从Local State文件里获取key
key = DecodeKey(fileLocalState)

valCookiesWithDecode = []
for i in valCookiesWithEncode:
    creation_utc, host_key, name, valEncrypted, path = i

    value = DecodeValue(valEncrypted, key)
    valCookiesWithDecode.append(
        tuple((creation_utc, host_key, name, value, path)))

cookies = Cookie(valCookiesWithDecode)
cookies.getOutFile()

五、应用

获取伪造的cookie文件就是本文章的主要目的，所以关于本文章通用模块的说明到这里就结束了。会爬虫的小伙伴可以利用拿到的cookie自己去尝试编写爬虫程序去爬取一些网站。对于不会爬虫的小伙伴，我在后面放了一个基于此通用模块输出的cookie访问CSDN的实例，可以参考。

1、介绍

cookie是爬取一些网站的关键步骤。一旦获取cookie文件，就可以利用它的特性去访问一些特定的网站，特别是那些需要用户登录的网站。下面以访问CSDN为例，讲解一下利用cookie爬取网站的方法。

2、构建app.py，用于访问CSDN

这一步的关键是headers访问头的构建，所以相对灵活，请根据实际情况构建，多结合浏览器的F12功能（本文不介绍headers的获取方法，可参考相应资料）。如果爬取失败，检查一下爬取的内容是否藏在了XHR里面了。

源代码：

from urllib.parse import *
from urllib.request import *
from http.cookiejar import *
import time
import ssl
import gzip
import re

ssl._create_default_https_context = ssl._create_unverified_context

# 要爬取的网页，注意csnd的网页内容不一定就在地址栏上，可能存放于xhr里面
url = 'https://bizapi.csdn.net/community-personal/v1/get-work?username=weixin_41936572'

# 建议将要爬取的网页里所有的header属性全部带进来（遇到冒号开头的可以删掉冒号）
headers = {  # <----------这个headers不能直接用，我删了一些东西，请自己制作headers------
    'authority': 'bizapi.csdn.net',
    'method': 'GET',
    'path': '/community-personal/v1/get-work?username=weixin_41936572',
    'scheme': 'https',
    'accept': 'application/json, text/plain, */*',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cache-control': 'no-cache',
    'cookie': '',
    'origin': 'https://i.csdn.net',
    'pragma': 'no-cache',
    'referer': 'https://i.csdn.net/',
    'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-site',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': 1,
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    'x-ca-key': 203796071,
    'x-ca-nonce': '',
    'x-ca-signature': '',
    'x-ca-signature-headers': 'x-ca-key,x-ca-nonce'
}

cookie_jar = MozillaCookieJar('./log/Cookie1658368460.log') # <-----这一步的文件改成自己通过main.py抓到的版本-----
cookie_jar.load(ignore_discard=True, ignore_expires=True)

cookie_processor = HTTPCookieProcessor(cookie_jar)
opener = build_opener(cookie_processor)

request = Request(url, headers=headers, method='GET')

response = opener.open(request)
data = response.read()
print(gzip.decompress(data)) # <----这里就是输出的网页内容

cookie_jar.save('./log/Cookie'+str(int(time.time())) + '.log',  # <--新的浏览器访问会产生新的cookie，这里是保存的方法
                ignore_discard=True, ignore_expires=True)