-
目录
1.项目设计
2.代码实现
3.调试 -
项目设计
1.功能描述
1)目标:获取网站的页面信息,将图片下载到本地
2)理解:
a.F12分析页面,获取网站图片存放地址(动态网页)
b.处理翻页
c.解析获取图片下载链接
d.技术路线:requests + json
e.可行性:查看网站robots公告,友好爬取User-Agent: * Disallow: /account/* Sitemap: https://unsplash.com/sitemaps/sitemap.xml
2.程序的结构设计
1)提交图片存放网址请求,循环获取页面
2)使用json.loads加载页面信息,提取图片路径
3)下载图片至本地 -
代码编写
# -*- coding: utf-8 -*-
import requests, json, os, ssl
class spider(object):
def __init__(self):
self.user_agent = {'user-agent':'Mozilla/5.0'}
self.url = 'https://unsplash.com/napi/photos?page={}&per_page=12'
def getHtml(self, url):
try:
ssl._create_default_https_context = ssl._create_unverified_context
req = requests.get(url, headers=self.user_agent, timeout=20)
req.raise_for_status()
req.encoding = req.apparent_encoding
return req
except Exception as e:
print('getHtml产生异常:{}'.format(e))
def parserHtml(self):
pass
def printPic(self, name, pic):
try:
name = name + '.jpg'
if not os.path.exists(name):
with open(name, 'wb') as f:
f.write(pic)
except:
pass
sp = spider()
for num in range(1):
url = sp.url.format(num)
print(url)
html = sp.getHtml(url)
if html:
html = html.text
for dic in json.loads(html):
name = dic.get('id')
urls = list(dic.get('urls').values()) # 'dict_values' object
pic = sp.getHtml(urls[0])
if pic:
pic = pic.content
sp.printPic(name, pic) # 高清图片较大,下载需要较长时间,不会太快
- 调试
1.SSL认证出错
getHtml产生异常:HTTPSConnectionPool(host='images.unsplash.com', port=443): Max retries exceeded with url: /photo-1560252118-b3ea5bc51cd1?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9 (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
1)设置.get()参数:verify=False
——能下载部分图片,但由于未进行谁,大部分图片无法下载,继续报错:
2)import ssl
,设置:ssl._create_default_https_context = ssl._create_unverified_context
——解决;不用修改verify=False
注:运时行需关闭Fiddler,它会自动修改访问端口
2.可能的优化
# closing可以将任意对象变为上下对象,可以使用with语句。
# 还没有想清楚是否必要
from contextlib import closing
with closing(sp.getHtml(urls[0])) as pic:
if pic:
pic = pic.content
sp.printPic(name, pic)