背景
最近鄙人发现我的壁纸库存严重不足,于是一个 需求 就诞生了,我需要大量的图片,很好看的图片
于是,我来到了万能的百度-------------------------------------------------------------------图片
是他,是他,就是他,我们的好帮手,百度图片!
下面,进入正题,鄙人将讲解爬取百度图片中的图片,和如何动态加载图片,并附上源码
首先,我们输入要查找的内容,我以橘猫为例(因为我好喜欢橘猫啊!!!):
- 好可爱有木有😍😍😍😍😍
- 既然有这么多可爱的橘猫,那么我就要都抱回家,嘿嘿嘿(●ˇ∀ˇ●)
开工咯
-
还是老程序
-
先 F12 打开工作台,然后查看 网络 选项
-
可以看到,这个请求的所获得响应的内容就是 图片信息,包括 图片的地址,图片所显示的简介 等等。
-
查看该请求 ,
- 请求地址如下:
major_url = 'https://image.baidu.com/search/index?'
- 后面跟上要传递的参数, GET传参
- 比较几个请求传递参数后,只有 queryWord, word, pn, gsm 传递的参数变化,传递的参数如下:
"tn": "resultjson_com",
"logid": "11587207680030063767",
"ipn": "rj",
"ct": "201326592",
"is": "",
"fp": "result",
"queryWord": '【要搜索的关键词】',
"cl": "2",
"lm": "-1",
"ie": "utf-8",
"oe": "utf-8",
"adpicid": "",
"st": "-1",
"z": "",
"ic": "0",
"hd": "",
"latest": "",
"copyright": "",
"word": '【要搜索的关键词】',
"s": "",
"se": "",
"tab": "",
"width": "",
"height": "",
"face": "0",
"istype": "2",
"qc": "",
"nc": "1",
"fr": "",
"expermode": "",
"force": "",
"pn": 【图片数量,一页30张图片】,
"rn": "30",
"gsm": 【图片数量所对应的八进制,一页30张图片】,
"1602481599433": ""
- 有了请求地址,再修改一下相应的参数,要搜索的关键词可以变换,搜索的页数也可以选择, 等我们请求完获得响应以后,利用 json 模块解析数据,就能得到我们想要的图片数据了
代码实现
- 上正餐
# -*- coding=utf-8 -*-
# @Time : 2020/12/18 19:24
# @Author : lhys
# @FileName: baidu.py
import requests
import urllib.parse as up
import json
import time
import os
major_url = 'https://image.baidu.com/search/index?'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
def pic_spider(kw, page = 10, file_path = os.getcwd()):
path = os.path.join(file_path, kw)
if not os.path.exists(path):
os.mkdir(path)
if kw != '':
for num in range(page):
data = {
"tn": "resultjson_com",
"logid": "11587207680030063767",
"ipn": "rj",
"ct": "201326592",
"is": "",
"fp": "result",
"queryWord": kw,
"cl": "2",
"lm": "-1",
"ie": "utf-8",
"oe": "utf-8",
"adpicid": "",
"st": "-1",
"z": "",
"ic": "0",
"hd": "",
"latest": "",
"copyright": "",
"word": kw,
"s": "",
"se": "",
"tab": "",
"width": "",
"height": "",
"face": "0",
"istype": "2",
"qc": "",
"nc": "1",
"fr": "",
"expermode": "",
"force": "",
"pn": num*30,
"rn": "30",
"gsm": oct(num*30),
"1602481599433": ""
}
url = major_url + up.urlencode(data)
i = 0
pic_list = []
while i < 5:
try:
pic_list = requests.get(url=url, headers=headers).json().get('data')
break
except:
print('网络不好,正在重试...')
i += 1
time.sleep(1.3)
for pic in pic_list:
url = pic.get('thumbURL', '') # 有的没有图片链接,就设置成空
if url == '':
continue
name = pic.get('fromPageTitleEnc')
for char in ['?', '\\', '/', '*', '"', '|', ':', '<', '>']:
name = name.replace(char, '') # 将所有不能出现在文件名中的字符去除掉
type = pic.get('type', 'jpg') # 找到图片的类型,若没有找到,默认为 jpg
pic_path = (os.path.join(path, '%s.%s') % (name, type))
print(name, '已完成下载')
if not os.path.exists(pic_path):
with open(pic_path, 'wb') as f:
f.write(requests.get(url = url, headers = headers).content)
pic_spider('刀剑神域')
- 看看效果
- 看起来效果还行
注!!!
- 我这里图片保存的名称是 百度提供的信息 ,可自行变换,比如说,用序号作为保存的名称
结尾
以上就是我要分享的内容,因为学识尚浅,会有不足,还请各位大佬指正。
有什么问题也可在评论区留言。