今天图有兴趣的去爬取了百度图片图片,想选择一定的尺寸、高清度等条件信息获取指定的图片,经过分析:
以下搜索:迪丽热巴,选择高清、特大尺寸【1600*2860像素】图片获取。
如果:直接获取则获取下来的尺寸是不对的,大小仅为:500 x 444像素。
经过百度图片的官网分析,此网站数据是采用Ajax技术动态加载图片数据,即:当鼠标向下滚动时,会再次请求数据库,数据库通过json的数据传输格式发送到前端,通过浏览器加载渲染。因此如果要获取多张图片时,需要构造多页的URL。
做为程序员,Google公司的Chrome浏览器是开发的利器,其能实现基本的‘抓包’功能【即检测到浏览器发送给服务器的请求,同时也能看到服务器响应给浏览器的请求】。Ajax技术实现的数据传输,可以通过以下方式观测到:
选择其中的请求,观察规律:
https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E8%BF%AA%E4%B8%BD%E7%83%AD%E5%B7%B4&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=9&ic=0&hd=1&latest=0©right=0&word=%E8%BF%AA%E4%B8%BD%E7%83%AD%E5%B7%B4&s=&se=&tab=&width=0&height=0&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=30&rn=30&gsm=1e&1565249894421=
https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E8%BF%AA%E4%B8%BD%E7%83%AD%E5%B7%B4&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=9&ic=0&hd=1&latest=0©right=0&word=%E8%BF%AA%E4%B8%BD%E7%83%AD%E5%B7%B4&s=&se=&tab=&width=0&height=0&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=60&rn=30&gsm=3c&1565249897307=
https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E8%BF%AA%E4%B8%BD%E7%83%AD%E5%B7%B4&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=9&ic=0&hd=1&latest=0©right=0&word=%E8%BF%AA%E4%B8%BD%E7%83%AD%E5%B7%B4&s=&se=&tab=&width=0&height=0&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn=90&rn=30&gsm=5a&1565249897413=
通过对比观察:
queryWord:表示要查询的关键字,如果是中文,则是通过URL编码。
word:也表示要查询的关键字,如果是中文,也是通过URL编码。
hd:值为1,表示高清
pn:表示第几页,如:第2页是: (2-1)*30, 第三页:(3-1)*30
rn:表示每页显示的图片数量。
则选择其中的一个URL,查看其数据,由于是采用Ajax技术实现,其数据格式为 Json格式,则要较为直观的观察其数据,可以通过Json视图查看,可以打开其网址:https://www.json.cn/, 粘贴json格式数据,则其会较为直观的展示其图片内容:
以上有图片的地址:
分别选择以上:thumbURL、middleURL、hoverURL对应的URL地址下载,查看对应的图片信息:
thumbURL | middleURL | hoverURL |
500 x 890像素 | 500 x 890像素 | 500 x 890像素 |
以上三种方式所对应的RUL均不是 特大尺寸【1600*2860像素】的图片的URL地址,通过点击进入详细页后下载出来的图片尺寸大小则是正确的,则分析详细页的URL地址:
https://image.baidu.com/search/detail?ct=503316480&z=9&ipn=d&word=%E8%BF%AA%E4%B8%BD%E7%83%AD%E5%B7%B4&step_word=&hs=0&pn=1&spn=0&di=171930&pi=0&rn=1&tn=baiduimagedetail&is=0%2C0&istype=2&ie=utf-8&oe=utf-8&in=&cl=2&lm=-1&st=-1&cs=1678640081%2C3765101474&os=12190547%2C426740624&simid=3367237481%2C175217088&adpicid=0&lpn=0&ln=652&fr=&fmq=1565249787800_R&fm=result&ic=0&s=undefined&hd=1&latest=0©right=0&se=&sme=&tab=0&width=0&height=0&face=undefined&ist=&jit=&cg=&bdtype=0&oriquery=&objurl=http%3A%2F%2Fb-ssl.duitang.com%2Fuploads%2Fitem%2F201811%2F10%2F20181110124915_NuNw4.jpeg&fromurl=ippr_z2C%24qAzdH3FAzdH3Fooo_z%26e3B17tpwg2_z%26e3Bv54AzdH3Fks52AzdH3F%3Ft1%3D8a8m0mlcnm&gsm=1e&rpstart=0&rpnum=0&islist=&querylist=&force=undefined
https://image.baidu.com/search/detail?ct=503316480&z=9&ipn=d&word=%E8%BF%AA%E4%B8%BD%E7%83%AD%E5%B7%B4&step_word=&hs=0&pn=2&spn=0&di=56760&pi=0&rn=1&tn=baiduimagedetail&is=0%2C0&istype=2&ie=utf-8&oe=utf-8&in=&cl=2&lm=-1&st=-1&cs=1392268776%2C1636379493&os=8832284%2C1407244711&simid=3411309683%2C394722903&adpicid=0&lpn=0&ln=652&fr=&fmq=1565249787800_R&fm=result&ic=0&s=undefined&hd=1&latest=0©right=0&se=&sme=&tab=0&width=0&height=0&face=undefined&ist=&jit=&cg=&bdtype=0&oriquery=&objurl=http%3A%2F%2Fi3.bbs.fd.zol-img.com.cn%2Fg5%2FM00%2F00%2F00%2FChMkJ1mdq7yIPiWoAAbvPaQyHU0AAf7HADEgHkABu9V949.jpg&fromurl=ippr_z2C%24qAzdH3FAzdH3Fkkf_z%26e3Bz5s_z%26e3Bv54_z%26e3BvgAzdH3Ff3kkfAzdH3F180lc_m80aa_7t1_ywg23tg28ada_z%26e3Bip4s&gsm=1e&rpstart=0&rpnum=0&islist=&querylist=&force=undefined
https://image.baidu.com/search/detail?ct=503316480&z=9&ipn=d&word=%E8%BF%AA%E4%B8%BD%E7%83%AD%E5%B7%B4&step_word=&hs=0&pn=4&spn=0&di=90860&pi=0&rn=1&tn=baiduimagedetail&is=0%2C0&istype=2&ie=utf-8&oe=utf-8&in=&cl=2&lm=-1&st=-1&cs=1221790608%2C2201828612&os=4260874520%2C241316579&simid=3392209277%2C577512865&adpicid=0&lpn=0&ln=652&fr=&fmq=1565249787800_R&fm=result&ic=0&s=undefined&hd=1&latest=0©right=0&se=&sme=&tab=0&width=0&height=0&face=undefined&ist=&jit=&cg=&bdtype=0&oriquery=&objurl=http%3A%2F%2Fb-ssl.duitang.com%2Fuploads%2Fitem%2F201810%2F16%2F20181016031010_jrA8R.jpeg&fromurl=ippr_z2C%24qAzdH3FAzdH3Fooo_z%26e3B17tpwg2_z%26e3Bv54AzdH3Fks52AzdH3F%3Ft1%3D8a8n9b8lc0&gsm=1e&rpstart=0&rpnum=0&islist=&querylist=&force=undefined
通过对几张图片的URL地址观察:
objURL:表示图片URL地址的来源,后跟的URL地址就是图片的正确的URL。
ippr_z2C$qAzdH3FAzdH3Fk-ffs_z&e3B17tpwg2_z&e3Bv54AzdH3F7rs5w1fAzdH3Ftpj4AzdH3Fda8bacAzdH3Fa9AzdH3Fda8baca9ddcd88_pSVbt_z&e3B3rj2
ippr_z2C$qAzdH3FAzdH3Fk-ffs_z&e3B17tpwg2_z&e3Bv54AzdH3F7rs5w1fAzdH3Ftpj4AzdH3Fda8b88AzdH3F8aAzdH3Fda8b888a8d9l8c_N7No9_z&e3B3rj2
ippr_z2C$qAzdH3FAzdH3Ftn_z&e3Bkkf_z&e3Bu1_z&e3Bz5s-t42_z&e3Bv54_z&e3BvgAzdH3F2cAzdH3FMaaAzdH3FaaAzdH3FaaAzdH3FCiMhJ841q0yIPtW5AAkePwQyHUaAAu0HADE2HhAB7lVl9l_z&e3B3r2
很明显这不是一个正确的URL地址,这是百度图片对此数据进行了加密,因此,需要解密出正确的URL地址。
1、首先展示图片的编码:
在响应中 ‘cores_ddf2bfa.js’ 中就存在有响应的编码对应的数据。
base_encode = {
'_z2C$q': ':',
'_z&e3B': '.',
'AzdH3F': '/',
'w': 'a',
'k': 'b',
'v': 'c',
'1': 'd',
'j': 'e',
'u': 'f',
'2': 'g',
'i': 'h',
't': 'i',
'3': 'j',
'h': 'k',
's': 'l',
'4': 'm',
'g': 'n',
'5': 'o',
'r': 'p',
'q': 'q',
'6': 'r',
'f': 's',
'p': 't',
'7': 'u',
'e': 'v',
'o': 'w',
'8': '1',
'd': '2',
'n': '3',
'9': '4',
'c': '5',
'm': '6',
'0': '7',
'b': '8',
'l': '9',
'a': '0',
'-': '-'
}
2、对获取的图片地址进行解码
def parse_strins(objurl):
if '_z2C$q' in base_encode:
objurl = objurl.replace('_z2C$q', ':')
if '_z&e3B' in obbase_encodejurl:
objurl = objurl.replace('_z&e3B', '.')
if 'AzdH3F' in base_encode:
objurl = objurl.replace('AzdH3F', '/')
res = ''
for s in objurl:
if s in base_encode:
res += sign_table[s]
else:
res += s
return res
以下选择了几个实例:
fromurls = [
'ippr_z2C$qAzdH3FAzdH3Fk-ffs_z&e3B17tpwg2_z&e3Bv54AzdH3F7rs5w1fAzdH3Ftpj4AzdH3Fda8bacAzdH3Fa9AzdH3Fda8baca9ddcd88_pSVbt_z&e3B3rj2',
'ippr_z2C$qAzdH3FAzdH3Fk-ffs_z&e3B17tpwg2_z&e3Bv54AzdH3F7rs5w1fAzdH3Ftpj4AzdH3Fda8b88AzdH3F8aAzdH3Fda8b888a8d9l8c_N7No9_z&e3B3rj2',
'ippr_z2C$qAzdH3FAzdH3Ftn_z&e3Bkkf_z&e3Bu1_z&e3Bz5s-t42_z&e3Bv54_z&e3BvgAzdH3F2cAzdH3FMaaAzdH3FaaAzdH3FaaAzdH3FCiMhJ841q0yIPtW5AAkePwQyHUaAAu0HADE2HhAB7lVl9l_z&e3B3r2'
]
for u in fromurls:
resUrl= parse_strins(u)
print(resUrl)
'''
http://b-ssl.duitang.com/uploads/item/201805/04/20180504225211_tSV8i.jpeg
http://b-ssl.duitang.com/uploads/item/201811/10/20181110124915_NuNw4.jpeg
http://i3.bbs.fd.zol-img.com.cn/g5/M00/00/00/ChMkJ1mdq7yIPiWoAAbvPaQyHU0AAf7HADEgHkABu9V949.jpg
'''
结果正确的解析出了图片的地址,同时对应的尺寸也是选择的特大尺寸。
完整代码:
import requests
import json
from urllib.parse import quote
import urllib.request
import os
# 百度图片加密规则
base_encode = {
'_z2C$q': ':',
'_z&e3B': '.',
'AzdH3F': '/',
'w': 'a',
'k': 'b',
'v': 'c',
'1': 'd',
'j': 'e',
'u': 'f',
'2': 'g',
'i': 'h',
't': 'i',
'3': 'j',
'h': 'k',
's': 'l',
'4': 'm',
'g': 'n',
'5': 'o',
'r': 'p',
'q': 'q',
'6': 'r',
'f': 's',
'p': 't',
'7': 'u',
'e': 'v',
'o': 'w',
'8': '1',
'd': '2',
'n': '3',
'9': '4',
'c': '5',
'm': '6',
'0': '7',
'b': '8',
'l': '9',
'a': '0',
'-': '-'
}
objURL = [] # 用于存储json格式的objURL地址
url_list = [] # 用于存储解码后的图片URL地址
# 伪装身份与解决防盗链
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'Referer': 'https://image.baidu.com/'}
# 通过url地址返回网页源码
def get_html(url):
rep = requests.get(url, headers=header)
if rep.status_code == 200:
return rep.text
else:
return None
# 获取json数据中objURL数据并存至列表中
def get_objURL(html):
if '\\' in html:
html = html.replace('\\', '')
print(html)
datas = json.loads(html)
ress = datas['data'][:-1]
for res in ress:
objURL.append(res["objURL"])
# 解码objURL地址,并存至解析后的列表中
def parse_strins(objurl):
if '_z2C$q' in objurl:
objurl = objurl.replace('_z2C$q', ':')
if '_z&e3B' in objurl:
objurl = objurl.replace('_z&e3B', '.')
if 'AzdH3F' in objurl:
objurl = objurl.replace('AzdH3F', '/')
res = ''
for s in objurl:
if s in base_encode:
res += base_encode[s]
else:
res += s
url_list.append(res)
# 创建并返回图片的存储路径
def create_path(names):
path = os.path.join(os.getcwd(), '{}'.format(names))
if not os.path.exists(path):
os.mkdir(path)
return path
# 实现图片文件的下载
def download_image(path, image_url, i):
end_type = os.path.splitext(image_url)[-1]
imageName = os.path.join(path, f'{i}{end_type}')
try:
urllib.request.urlretrieve(image_url, imageName)
print('已下载第{}张图片...'.format(i + 1))
except:
pass
# 实现以上函数的调用与功能的实现
def main():
while True:
names = input('请输入要查询的关键字:')
if not names:
break
path = create_path(names)
try:
pages = int(input('请输入爬取的页码数:'))
except:
print('请输入正确的页码数【int】')
print('-' * 100)
else:
for i in range(pages):
print('正在爬取第 {} 页的数据...'.format(i + 1))
page = i * 30
urls = 'http://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={0}=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=©right=&word={0}&s=&se=&tab=&width=1920&height=1080&face=0&istype=2&qc=&nc=&fr=&expermode=&force=&cg=star&pn={1}&rn=30&gsm=1f'.format(quote(names), page)
print(urls)
html = get_html(urls)
if html:
get_objURL(html)
for obj in objURL:
parse_strins(obj)
for m in range(len(url_list)):
download_image(path, url_list[m], m)
if __name__ == '__main__':
main()