Python爬取MidJourney历史图片【仅供参考学习使用】

YoungSoulwt

已于 2023-07-28 13:41:02 修改

阅读量653

点赞数 1

分类专栏：爬虫文章标签： python midjourney 学习爬虫

于 2023-07-21 16:24:26 首次发布

本文链接：https://blog.csdn.net/YoungSoulwt/article/details/131854646

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1、需求概要

使用MidJourney时，

在https://www.midjourney.com/app/这里有接口https://www.midjourney.com/api/app/recent-jobs/?amount=35&dedupe=true&jobStatus=completed&jobType=upscale&orderBy=new&page=3&prompt=undefined&refreshApi=0&searchType=advanced&service=null&toDate=2023-06-16+09%3A50%3A17.379092&type=all&userId=b12e169c-f609-4fd6-b917-11c2deaa8cff&user_id_ranked_score=null&_ql=todo&_qurl=https%3A%2F%2F www.midjourney.com%2Fapp%2F，返回的json串中有图片的相关信息，要把整个json串都存下来。

2、实现思路

（1）在翻页的时候，接口中的page值会发生变化，所以可以通过循环遍历page的值来实现爬取页面信息，目前发现page最大的值为50，所以可以 for page in range(50)

（2）接口中还有很多属性，如下：

属性名	值	猜测作用
amount	35	一页包含的图片数量（目测最大50）
dedupe	true	去重（是）
jobStatus	completed	图片状态（完成）
jobType	upscale	图片类型（高级），对应页面上的All，Grids（网格状），Upscales
orderBy	new	排序（按最新），对应页面上的Hot，New，Top，Favorited
page	3	当前页码（目测最大为50）
prompt	undefined	提示（未定义）
refreshApi	0
searchType	advanced
service	null
toDate	2023-06-16 09:50:17.379092	这个等会得细看下，这个属性貌似只有orderby为new的时候会有
type	all
userId	b12e169c-f609-4fd6-b917-11c2deaa8cff
user_id_ranked_score	null
_ql	todo
_qurl	https://www.midjourney.com/app/
isPublished	true	这个对应页面上的ispublished选项（感觉不用特殊设置）

通过改变某些属性的值也许能找到相对正确的爬取策略

（3）可以简单地将爬取到的大量json串保存在txt中，等待后续的处理

3、具体实现

通过研究发现，orderby使用new或者enqueue_time的时候，再加上toDate参数，就会返回固定的json串，page最大值为50，amount最大值为50，也就是说，每次循环50次page最多爬取到50*50=2500张图，然后再取最后得到的图片的enqueue_time为新的toDate，进行下一轮的50次循环爬取，如此反复，就可以实现将某日前的所有历史图片数据全部爬取下来。注意每次爬取间睡个几秒，免得被封掉。

4、代码

import json
import requests
import random
import time


#
#                               _ooOoo_
#                              o8888888o
#                              88" . "88
#                              (| -_- |)
#                              O\  =  /O
#                           ____/`---'\____
#                         .'  \\|     |//  `.
#                        /  \\|||  :  |||//  \
#                       /  _||||| -:- |||||-  \
#                       |   | \\\  -  /// |   |
#                       | \_|  ''\---/''  |   |
#                       \  .-\__  `-`  ___/-. /
#                     ___`. .'  /--.--\  `. . __
#                  ."" '<  `.___\_<|>_/___.'  >'"".
#                 | | :  `- \`.;`\ _ /`;.`/ - ` : | |
#                 \  \ `-.   \_ __\ /__ _/   .-` /  /
#            ======`-.____`-.___\_____/___.-`____.-'======
#                               `=---='
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#                       佛祖保佑        永无bug
#                       成功爬取        不会被ban


# 构建请求头：
headers = {
                # 这里需要自己构建请求头
}
# 切分url
big_pre_url = f"https://www.midjourney.com/api/app/recent-jobs/?amount=50&dedupe=true&jobStatus=completed&orderBy=enqueue_time&page="
# 维护一个数组，用于循环
# 需要填入开始爬取的日期
# 可以使用datatime操作时间参数，这里简单点就用字符串就可以
toDate = ['']
# 创建一个标记符，用于控制循环结束
tag = True
# 开始循环爬取
print("开始爬取")
while tag:
    try:
        # 拆分url
        small_pre_url = f"&prompt=undefined&refreshApi=0&searchType=advanced&service=null&toDate=2023-"
        small_after_url = f"+07%3A21%3A03.591348&type=all&userId=b12e169c-f609-4fd6-b917-11c2deaa8cff&user_id_ranked_score=null&_ql=todo&_qurl=https%3A%2F%2Fwww.midjourney.com%2Fapp%2F"
        # 从维护的数组中更新url，进行循环爬取
        big_after_url = small_pre_url + toDate[-1] + small_after_url
        # 内部循环page从0到50
        for page in range(1, 51):
            print("正在爬取toDate为：" + str(toDate[-1]) + "，page为：" + str(page))
            # 睡一小会，免得号没了
            time.sleep(random.randint(5, 10))
            # 拼接url
            url = big_pre_url + str(page) + big_after_url
            # 发出请求得到响应
            response = requests.get(url=url, headers=headers).json()
            # json入库
            with open("images_json.txt", "a+", encoding="utf-8") as f:
                f.write(str(response) + '\n')
            # 维护下一个循环
            new_toDate = response[49]["enqueue_time"][5:10:]
        # 将最新的数据插到toDate中用于下一轮循环
        toDate.append(new_toDate)
    except Exception as e:
        tag = False
        print("-----------------出现异常，终止循环-----------------")
        print("异常信息为：" + e)
        print("当前 toDate：" + str(toDate[-1]) + "当前page：" + str(page))