mitmproxy 安装，使用，抓取 HTTPS，以及python脚本编写

最新推荐文章于 2024-09-15 20:00:00 发布

Leri_X

最新推荐文章于 2024-09-15 20:00:00 发布

阅读量5.3k

点赞数 6

分类专栏：工具 scrapy 文章标签： python 爬虫

本文链接：https://blog.csdn.net/Defiler_Lee/article/details/109395619

版权

工具同时被 2 个专栏收录

10 篇文章 1 订阅

订阅专栏

scrapy

4 篇文章 0 订阅

订阅专栏

mitmproxy 可以抓到非浏览器发起的请求，（很多别的抓包工具也可以，但是我平时图省事是只用浏览器开发者工具的）
mitmproxy 能直接将抓到的http请求包引入到 python中进行编辑，比如可以自己撰写脚本，将抓到的包直接构造 requests 或者 scrapy.Requests 对象，或者将抓到的包，按请求顺序一次保存在队列中，再通过 requests 类封装，便于以后爬虫开发。

mitmproxy 安装

我是直接通过python pip 安装的

pip install mitmproxy

安装过程中遇到了 ruamel-yaml 的问题：

Pip subprocess error:
ERROR: Cannot uninstall 'ruamel-yaml'. It is a distutils installed project and thus we cannot 
accurately determine which files belong to it which would lead to only a partial uninstall.

经过一番Google，找到解决方案：直接在 anaconda 中删除 ruamel 相关的文件即可：

windows 路径 anaconda/Lib/site-packages/ruamel*

Linux 路径 anaconda3/lib/python3.8/site-packages/ruamel*

mitmproxy 的使用

mitmproxy 安装以后提供了三个执行程序：mitmproxy， mitmdump， mitmweb，直接在控制台输入即可。

mitmproxy 提供了 shell 交互式的抓包界面，但是只能在 Linux 环境中使用

mitmdump 后台抓包，

mitmweb 会在默认浏览器打开一个抓包可视化的界面，

我自己常用的参数：

-w 指定输出的文件

-s 指定抓包时执行的脚本

开启抓包代理以后，浏览器打开链接 http://mitm.it/ ，安装SSL证书，便于抓取 HTTPS 。

使用 mitmproxy 抓包时，发现很多请求会返回 413 错误，找到解决方案是抓包时候，添加 --set http2=false 参数，即：

mitmweb.exe -s .\gid.py --set http2=false

mitmproxy 脚本

我个人基本上用这些就够了，不同的函数对应了HTTP请求的不同阶段，脚本会在请求对应的阶段执行对应的函数，可以通过这些函数对请求进行更改，

import mitmproxy.http
import pickle
import os
import json


class GetSeq:

    def __init__(self, domains=[], url_pattern=None, ):
        self.num = 1
        self.dirpath = "./flows/"
        if not os.path.exists(self.dirpath):
            os.mkdir(self.dirpath)
        self.domains = domains
        self.url_pattern = url_pattern

    def http_connect(self, flow: mitmproxy.http.HTTPFlow):
        """
            An HTTP CONNECT request was received. Setting a non 2xx response on
            the flow will return the response to the client abort the
            connection. CONNECT requests and responses do not generate the usual
            HTTP handler events. CONNECT requests are only valid in regular and
            upstream proxy modes.
        """

    def requestheaders(self, flow: mitmproxy.http.HTTPFlow):
        """
            HTTP request headers were successfully read. At this point, the body
            is empty.
        """

    def request(self, flow: mitmproxy.http.HTTPFlow):
        """
            The full HTTP request has been read.
        """

    def responseheaders(self, flow: mitmproxy.http.HTTPFlow):
        """
            HTTP response headers were successfully read. At this point, the body
            is empty.
        """

    def response(self, flow: mitmproxy.http.HTTPFlow):
        """
            The full HTTP response has been read.
        """

        # 自行更改这里的保存代码，此处仅供参考
        def save_flow():
            fname = "{}flow-{:0>3d}-{}.pkl".format(self.dirpath, self.num, flow.request.host)
            pickle.dump({
                "num": self.num,
                "request": flow.request,
                "response": flow.response
            }, open(fname, "wb"))

            log_data = dict(
                num = self.num,
                url = flow.request.url,
                fname = fname
            )

            with open("flow_que.log", "a+", encoding="utf8") as f:
                s = json.dumps(log_data)
                f.write(s)

            self.num += 1


        # 添加自己的过滤需求
        if flow.request.headers.get('content-type', None) == "application/json":
            save_flow()

        if len(self.domains) == 0: save_flow()
        for domain in self.domains:
            if domain in flow.request.url:
                save_flow()
                


    def error(self, flow: mitmproxy.http.HTTPFlow):
        """
            An HTTP error has occurred, e.g. invalid server responses, or
            interrupted connections. This is distinct from a valid server HTTP
            error response, which is simply a response with an HTTP error code.
        """

addons = [
    GetSeq(
        domains=[
            "baidu.com",
        ],
        url_pattern = None,
    )
]