利用python的requests库BeautifulSoup库获取天堂图片网的图片。
网站链接:https://www.ivsky.com/
(1)分析网址原代码
分析代码可知图片的链接被封装在<li>
之中的<img>
里面。所以可以使用BeautifulSoup进行解析html,从而获取url。
获取图片链接的代码如下:
def fillImageList(html, imalist):
soup = BeautifulSoup(html, 'html.parser')
for img in soup.find_all('img'):
imalist.append(img['src'])
将获取的链接存储在imalist列表之中。
(2)从获取链接列表中下载图片并保存到本地文件夹之中。
代码如下:
def downImage(imalist):
for image_url in imalist:
url = 'http:' + image_url
root = 'C://image//'
path = root + image_url.split('/')[-1]
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path, 'wb') as f:
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已经存在")
由于在网页上获取的图片链接形式缺少http:的开头所以在代码中利用url = 'http:' + image_url
获取完整的图片链接。
(3)总代码如下:
from bs4 import BeautifulSoup
import requests
import os
def getHTMLText(url):
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
def fillImageList(html, imalist):
soup = BeautifulSoup(html, 'html.parser')
for img in soup.find_all('img'):
imalist.append(img['src'])
def downImage(imalist):
count = 1
for image_url in imalist:
url = 'http:' + image_url
root = 'C://image//'
path = root + image_url.split('/')[-1]
print('\r当前速度:{:.2f}%'.format(count*100/len(imalist)),end='')
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path, 'wb') as f:
f.write(r.content)
f.close()
count = count + 1
else:
print("文件已经存在")
def main():
imalist = []
url = input('输入需要爬取的网站链接:')
html = getHTMLText(url)
fillImageList(html, imalist)
downImage(imalist)
main()