Python 原生爬虫教程:HTTP 的请求和响应

Json19970108018

于 2025-05-14 08:49:39 发布

阅读量255

点赞数 4

分类专栏： Python 原生爬虫教程文章标签： python 爬虫 http

本文链接：https://blog.csdn.net/2510_91865210/article/details/147940188

版权

Python 原生爬虫教程专栏收录该内容

7 篇文章

订阅专栏

HTTP 请求与响应基础

HTTP 协议是爬虫的核心，请求由客户端（如浏览器或 Python 程序）发送到服务器，响应则是服务器返回的数据。一个完整的 HTTP 请求包含：

请求行（如 GET /index.html HTTP/1.1）
请求头（如 User-Agent、Content-Type）
请求体（POST 请求的数据）

响应则包含：

状态行（如 HTTP/1.1 200 OK）
响应头（如 Content-Type、Set-Cookie）
响应体（HTML、JSON 等内容）

使用 urllib 库发送 HTTP 请求

Python 内置的urllib库提供了基础的 HTTP 请求功能：

python

运行

from urllib import request, parse
from urllib.error import HTTPError, URLError
import json

# 1. 发送GET请求
def send_get_request(url, params=None):
    try:
        # 处理URL参数
        if params:
            query_string = parse.urlencode(params)
            url = f"{url}?{query_string}"
        
        # 创建请求对象
        req = request.Request(url)
        
        # 添加请求头（模拟浏览器）
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
        
        # 发送请求并获取响应
        with request.urlopen(req, timeout=10) as response:
            # 获取响应状态码
            status_code = response.status
            # 获取响应头
            headers = dict(response.headers.items())
            # 获取响应内容（字节流）
            content = response.read()
            # 解码为字符串（通常是UTF-8或GBK）
            html = content.decode('utf-8')
            
            print(f"状态码: {status_code}")
            print(f"内容类型: {headers.get('Content-Type')}")
            return html
    
    except HTTPError as e:
        print(f"HTTP错误: {e.code} - {e.reason}")
    except URLError as e:
        print(f"URL错误: {e.reason}")
    except Exception as e:
        print(f"其他错误: {e}")

# 示例：爬取百度首页
baidu_html = send_get_request('https://www.baidu.com')
print(baidu_html[:200])  # 打印前200个字符

# 2. 发送POST请求
def send_post_request(url, data):
    try:
        # 将字典数据转换为字节流
        post_data = parse.urlencode(data).encode('utf-8')
        
        # 创建POST请求对象
        req = request.Request(url, data=post_data, method='POST')
        
        # 添加请求头
        req.add_header('Content-Type', 'application/x-www-form-urlencoded')
        req.add_header('User-Agent', 'Mozilla/5.0')
        
        # 发送请求
        with request.urlopen(req) as response:
            content = response.read().decode('utf-8')
            return content
    
    except Exception as e:
        print(f"请求出错: {e}")

# 示例：向JSONPlaceholder API发送POST请求
api_url = 'https://jsonplaceholder.typicode.com/posts'
post_data = {
    'title': '爬虫测试',
    'body': '这是一个测试内容',
    'userId': 1
}

response_data = send_post_request(api_url, post_data)
print(json.dumps(json.loads(response_data), indent=2))  # 格式化输出JSON

处理响应内容

获取响应后，常见的处理方式包括：

python

运行

# 1. 解析JSON响应
def parse_json_response(url):
    response = request.urlopen(url)
    data = json.loads(response.read().decode('utf-8'))
    return data

# 示例：获取GitHub API数据
github_data = parse_json_response('https://api.github.com/users/octocat')
print(f"GitHub用户: {github_data['login']}, 关注者: {github_data['followers']}")

# 2. 处理二进制响应（如图像、文件）
def download_file(url, filename):
    try:
        with request.urlopen(url) as response, open(filename, 'wb') as out_file:
            # 分块读取并写入，适合大文件
            while True:
                chunk = response.read(1024)
                if not chunk:
                    break
                out_file.write(chunk)
        print(f"文件 {filename} 下载完成")
    except Exception as e:
        print(f"下载失败: {e}")

# 示例：下载图片
download_file('https://picsum.photos/200/300', 'example.jpg')

处理请求头和 Cookie

请求头和 Cookie 在爬虫中非常重要，可以模拟浏览器行为或保持会话状态：

python

运行

from http.cookiejar import CookieJar

# 创建带有Cookie处理的opener
cj = CookieJar()
opener = request.build_opener(request.HTTPCookieProcessor(cj))

# 添加常用请求头
opener.addheaders = [
    ('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'),
    ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
    ('Accept-Language', 'zh-CN,zh;q=0.9,en;q=0.8'),
    ('Connection', 'keep-alive')
]

# 使用opener发送请求
try:
    response = opener.open('https://www.example.com')
    html = response.read().decode('utf-8')
    print(f"Cookie: {cj}")
except Exception as e:
    print(f"请求出错: {e}")