例:把京东商城手机类商品的图片全部下载到本地。
根据 url 猜测后续页面的 url,这一点并没有什么难度。
第 1 页的 ulr:
https://list.jd.com/list.html?cat=9987,653,655
第 2 页的 url:
https://list.jd.com/list.html?cat=9987,653,655&page=2&sort=sort_rank_asc&trans=1&JL=6_0_0#J_main
示例:
import re
import urllib.request
def crawl(url, page):
html1 = urllib.request.urlopen(url).read()
html1 = str(html1)
# 匹配了一个 div 多了一点一点的内容,注意这里没有括号,和后面没有括号的区别
pattern1 = '<div id="plist".*? <div class="page clearfix">'
result1 = re.compile(pattern1).findall(html1)
result1 = result1[0]
pattern2 = '<img width="220" height="220" data-img="1" data-lazy-img="//(.*?\.jpg)">'
imagelist = re.compile(pattern2).findall(result1)
x = 1
for imageurl in imagelist:
imagename = "./img/" + str(page) + "-" + str(x) + ".jpg"
imageurl = "http://" + imageurl
try:
urllib.request.urlretrieve(imageurl, filename=imagename)
except urllib.error.URLError as e:
if hasattr(e, "error"):
x += 1
if hasattr(e, "reason"):
x += 1
x += 1
for i in range(1, 6):
url = "https://list.jd.com/list.html?cat=9987,653,655&page=%s" % str(i)
crawl(url, i)
print("图片下载完毕。")
使用 urllib 下载网页源代码的代码:
import urllib.request
url = 'https://list.jd.com/list.html?cat=9987,653,655&page=1'
html = urllib.request.urlopen(url).read()
# <class 'bytes'>
html = html.decode('utf-8')
# html = html.decode('utf-8', 'ignore')
print(html)
使用 urllib 下载图片的代码:
import urllib.request
imgurl = 'https://img11.360buyimg.com/n7/jfs/t4534/93/3556552833/67545/111fa009/590300b9Nde91dc43.jpg'
filename = "./" + imgurl.split("/")[-1]
# retrieve 取回
urllib.request.urlretrieve(imgurl, filename=filename)