先以搜索词“资产债券表”为例:
我们在百度图片中搜索“资产债券表”,向下滑动网页时,图片会不断加载,因此该网页是动态加载的网页,按F12,选择网络,在消息中选择过滤XHR消息。
可以看出,该请求为GET请求,返回的是一个JSON文件,查看响应消息,里面的data蕴含着我们想要获取的每个图片的url。
由此,我们可以利用requests模块的get方法模拟浏览器发送请求,获得其对应的JSON数据。
可能使用到的头文件有:
import re
import os
import requests
使用requests模块的get()方法可以获得一个网址的源码,其中params里面的参数可以右键-复制全部得到。然后通过key="data"获取我们想要的url。
url = 'https://image.baidu.com/search/acjson'
params={
"tn": "resultjson_com",
"logid": "11555092689241190059",
"ipn": "rj",
"ct": "201326592",
"is": "",
"fp": "result",
"queryWord": "资产债券表",
"cl": "2",
"lm": "-1",
"ie": "utf-8",
"oe": "utf-8",
"adpicid": "",
"st": "-1",
"z": "",
"ic": "0",
"hd": "",
"latest": "",
"copyright": "",
"word":"资产债券表" ,
"s": "",
"se": "",
"tab": "",
"width": "",
"height": "",
"face": "0",
"istype": "2",
"qc": "",
"nc": "1",
"fr": "",
"expermode": "",
"force": "",
"pn": 0,
"rn":"60" ,
"gsm": "1e",
"1617626956685": ""
}
result = requests.get(url, headers=header,params=params).json()
url_list=[]
for data in result['data'][:-1]:
url_list.append(data['thumbURL'])
url_list为一个有60个元素的列表,每个元素对应一张图片的链接。
最后一步,遍历url_list中的所有元素,逐个下载保存即可。
def getImg(url,idx,path):
img=requests.get(url,headers=header)
file=open(path+str(idx)+'.jpg','wb')
file.write(img.content)
file.close()
for i in range(len(url_list)):
getImg(url_list[i],60*page+i,path)
注意:
1.请求参数中的"word"和"queryword"为我们的关键词;
2."pn"表示开始索引;
3."rn"表示获取的张数(每一页最多可获取60张图片,要想获得更多图片,重复发请求即可)。
效果图如下:
完整代码如下:
import re
import os
import requests
import tqdm
header={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'}
def getImg(url,idx,path):
img=requests.get(url,headers=header)
file=open(path+str(idx)+'.jpg','wb')
file.write(img.content)
file.close()
search=input("请输入搜索内容:")
number=int(input("请输入需求数量:"))
path='image/'+search+'/'
if not os.path.exists(path):
os.makedirs(path)
bar=tqdm.tqdm(total=number)
page=0
while(True):
if number==0:
break
url = 'https://image.baidu.com/search/acjson'
params={
"tn": "resultjson_com",
"logid": "11555092689241190059",
"ipn": "rj",
"ct": "201326592",
"is": "",
"fp": "result",
"queryWord": search,
"cl": "2",
"lm": "-1",
"ie": "utf-8",
"oe": "utf-8",
"adpicid": "",
"st": "-1",
"z": "",
"ic": "0",
"hd": "",
"latest": "",
"copyright": "",
"word": search,
"s": "",
"se": "",
"tab": "",
"width": "",
"height": "",
"face": "0",
"istype": "2",
"qc": "",
"nc": "1",
"fr": "",
"expermode": "",
"force": "",
"pn": str(60*page),
"rn": number,
"gsm": "1e",
"1617626956685": ""
}
result = requests.get(url, headers=header,params=params).json()
url_list=[]
for data in result['data'][:-1]:
url_list.append(data['thumbURL'])
for i in range(len(url_list)):
getImg(url_list[i],60*page+i,path)
bar.update(1)
number-=1
if number==0:
break
page+=1
print("\nfinish!")
注释:Tqdm 是一个快速,可扩展的Python进度条,可以在 Python 长循环中添加一个进度提示信息,用户只需要封装任意的迭代器 tqdm(iterator)。
给一张GIF图看一下实际效果:
运行多次后会发现,
result = requests.get(url, headers=header,params=params).json()
这个使用多次以后会出现问题 ,就是json解析错误 ,比如说我要爬一百个植物品种,每个10个,每次爬到 40多个就报错。报错类似于这种 json.decoder.JSONDecodeError: Invalid \escape: line 1 column 44 (char 43)
大概意思就是说出现了转换错误。
改装一下代码
result = requests.get(url, headers=header,params=params).text
result=demjson.decode(result)
使用 demjson模块 再次爬取100个植物品种每个10张图片,没有出错。
最终代码:
import re
import os
import demjson as demjson
import requests
import tqdm
header={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'}
def getImg(url,idx,path):
img=requests.get(url,headers=header)
file=open(path+str(idx)+'.jpg','wb')
file.write(img.content)
file.close()
search=input("请输入搜索内容:")
number=int(input("请输入需求数量:"))
path='image/'+search+'/'
if not os.path.exists(path):
os.makedirs(path)
bar=tqdm.tqdm(total=number)
page=0
while(True):
if number==0:
break
url = 'https://image.baidu.com/search/acjson'
params={
"tn": "resultjson_com",
"logid": "11555092689241190059",
"ipn": "rj",
"ct": "201326592",
"is": "",
"fp": "result",
"queryWord": search,
"cl": "2",
"lm": "-1",
"ie": "utf-8",
"oe": "utf-8",
"adpicid": "",
"st": "-1",
"z": "",
"ic": "0",
"hd": "",
"latest": "",
"copyright": "",
"word": search,
"s": "",
"se": "",
"tab": "",
"width": "",
"height": "",
"face": "0",
"istype": "2",
"qc": "",
"nc": "1",
"fr": "",
"expermode": "",
"force": "",
"pn": str(60*page),
"rn": number,
"gsm": "1e",
"1617626956685": ""
}
result = requests.get(url, headers=header,params=params).text
result = demjson.decode(result)
url_list=[]
for data in result['data'][:-1]:
url_list.append(data['thumbURL'])
for i in range(len(url_list)):
getImg(url_list[i],60*page+i,path)
bar.update(1)
number-=1
if number==0:
break
page+=1
print("\nfinish!")