爬虫系列教程（10.2）--- 基于Session和Cookie的模拟登录爬取实战

本文链接：https://blog.csdn.net/qq_51749909/article/details/143089371

前言

模拟登录是指使用程序自动化地完成用户登录过程，以便获取登录后才能访问的数据或页面。

本节介绍了模拟登录的基本原理和两种主要模式：基于Session和Cookie的模拟登录，以及基于JWT（JSON Web Token）的模拟登录。

1. 准备工作

在进行模拟登录之前，需要准备好以下工具和库：

requests：用于发送HTTP请求。
Selenium：用于模拟浏览器操作。
Redis：用于存储账号和Cookie信息。

2. 案例介绍

以网站 https://login2.scrape.center/ 为例，该网站需要登录后才能查看数据。

3. 模拟登录

模拟登录的过程包括发送登录请求并获取Cookie，然后使用Cookie进行后续请求。

3.1 分析登录过程

使用浏览器的开发者工具分析登录请求，发现登录请求的URL是 https://login2.scrape.center/login，通过POST方法提交用户名和密码。

发送登录请求：通过模拟用户登录行为，发送包含用户名和密码的POST请求。
获取响应：服务器验证通过后，返回包含Set-Cookie的响应头，该Cookie中包含Session ID。
携带Cookie请求：后续的请求中携带该Cookie，以维持登录状态。

3.2 使用requests模拟登录

import requests
from urllib.parse import urljoin

BASE_URL = "https://login2.scrape.center/"
LOGIN_URL = urljoin(BASE_URL, "/login")
INDEX_URL = urljoin(BASE_URL, "/page/1")
USERNAME = 'admin'
PASSWORD = 'admin'

# 模拟登录请求
response_login = requests.post(LOGIN_URL, data={
    "username": USERNAME,
    "password": PASSWORD
}, allow_redirects=False)
# 获取Cookie
cookies = response_login.cookies
print('Cookies', cookies)

# 使用Cookie进行后续请求
response_index = requests.get(INDEX_URL, cookies=cookies)
print('Response Status', response_index.status_code)
print('Response URL', response_index.url)

3.3 使用Session对象简化操作

import requests
from urllib.parse import urljoin

BASE_URL = "https://login2.scrape.center/"
LOGIN_URL = urljoin(BASE_URL, "/login")
INDEX_URL = urljoin(BASE_URL, "/page/1")
USERNAME = 'admin'
PASSWORD = 'admin'

# 使用Session对象
session = requests.Session()
response_login = session.post(LOGIN_URL, data={
    "username": USERNAME,
    "password": PASSWORD
})
cookies = session.cookies
print('Cookies', cookies)

response_index = session.get(INDEX_URL)
print('Response Status', response_index.status_code)
print('Response URL', response_index.url)

4. 使用Selenium获取Cookie

当遇到复杂的登录过程时，可以使用Selenium模拟浏览器操作获取Cookie。

from selenium import webdriver
from urllib.parse import urljoin
import requests
import time

BASE_URL = "https://login2.scrape.center/"
LOGIN_URL = urljoin(BASE_URL, "/login")
INDEX_URL = urljoin(BASE_URL, "/page/1")
USERNAME = 'admin'
PASSWORD = 'admin'

# 使用Selenium模拟登录
browser = webdriver.Chrome()
browser.get(LOGIN_URL)
browser.find_element_by_css_selector('input[name="username"]').send_keys(USERNAME)
browser.find_element_by_css_selector('input[name="password"]').send_keys(PASSWORD)
browser.find_element_by_css_selector('input[type="submit"]').click()
time.sleep(10)  # 等待登录成功

# 获取Cookie
cookies = browser.get_cookies()
print('Cookies', cookies)
browser.close()

# 使用requests和Cookie进行请求
session = requests.Session()
for cookie in cookies:
    session.cookies.set(cookie['name'], cookie['value'])

response_index = session.get(INDEX_URL)
print('Response Status', response_index.status_code)
print('Response URL', response_index.url)

5. 账号池

账号池是一种管理大量账号的技术，通过分流请求到不同的账号，降低被封号的风险，并提高爬虫的并发量。

5.1 账号池的架构

账号池包括以下模块：

存储模块：使用Redis存储账号信息和Cookie。
获取模块：生成新的Cookie并更新存储模块。
检测模块：定时检测Cookie的有效性。
接口模块：提供API接口获取随机Cookie。

5.2 流程步骤

账号池搭建：维护多个账号信息及其Cookie。
定时检测：定期检测Cookie的有效性，更新失效的Cookie。
随机使用：每次请求随机使用账号池中的一个账号。

5.3 存储模块的实现

import redis
import random

class RedisClient(object):
    def __init__(self, host='localhost', port=6379, password='', decode_responses=True):
        self.db = redis.StrictRedis(host=host, port=port, password=password, decode_responses=decode_responses)

    def set(self, username, value):
        return self.db.hset('account:antispider6', username, value)

    def get(self, username):
        return self.db.hget('account:antispider6', username)

    def delete(self, username):
        return self.db.hdel('account:antispider6', username)

    def count(self):
        return self.db.hlen('account:antispider6')

    def random(self):
        return random.choice(list(self.db.hvals('account:antispider6')))

5.3 获取模块的实现

import requests
from accountpool.storages.redis import RedisClient

class Antispider6Generator(BaseGenerator):
    def generate(self, username, password):
        session = requests.Session()
        response = session.post(LOGIN_URL, data={
            "username": username,
            "password": password
        })
        cookies = '; '.join([f'{cookie.name}={cookie.value}' for cookie in session.cookies])
        self.credential_operator.set(username, cookies)
        return cookies

5.4 检测模块的实现

import requests
from accountpool.storages.redis import RedisClient

class Antispider6Tester(BaseTester):
    def test(self, username, credential):
        try:
            response = requests.get(TEST_URL, headers={
                'Cookie': credential
            })
            if response.status_code == 200:
                return True
            return False
        except requests.ConnectionError:
            return False

5.5 接口模块的实现

from flask import Flask, g
from accountpool.storages.redis import RedisClient

app = Flask(__name__)

@app.route('/antispider/random')
def get_random_credential():
    if not hasattr(g, 'credential_operator'):
        g.credential_operator = RedisClient()
    return g.credential_operator.random()