爬取效果如下(别问我为什么被分成了两张还有去不掉的水印,不然网站怎么赚钱):
爬取链接为(千图网的图片,顺便吐槽一下,这个有些图片是分两半的!!!):
https://www.58pic.com/piccate/10-0-0-p1.html
记得先F12看所需要的内容具体位置和代码,可以找出,图片列表需要的href="//www.58pic.com/sucai/19381028.html":
<a href="//www.58pic.com/sucai/19381028.html" class="thumb-box" data-id="19381028" target="_blank">
<div class="card-trait"><div class="card-tag-business a190"></div></div><img src="//preview.qiantucdn.com/58picmark/back_origin_pic/19/38/10/28M58PICdRdmHaG8694atMaRk.JPG!qt324new_nowater" alt="炫彩星空背景" data-i="0" class="back-static" width="324" height-no="432"><div class="card-bg"></div><div class="card-handle"><div class="handle-dum qt-btn btn-green-linear download-page" data-id="19381028" data-yc="1" data-bg="0"><span class="xfdl"><i class="icon-down"></i>免费下载</span></div><div class="handle-fav" data-action="addFav" data-id="19381028"><span></span></div>
</div>
</a>
详细页面所需要的图片链接src="//preview.qiantucdn.com/58picmark/back_origin_pic/19/38/10/28M58PICdRdmHaG8694atMaRk.JPG!w1024_small" class="show-area-pic":
<img src="//preview.qiantucdn.com/58picmark/back_origin_pic/19/38/10/28M58PICdRdmHaG8694atMaRk.JPG!w1024_small" class="show-area-pic" id="show-area-pic" alt="炫彩星空背景" title="炫彩星空背景" width="650">
记得新建一个存储图片的文件夹picture。
抓取代码如下:
import socket
import re
import requests
from bs4 import BeautifulSoup
from lxml import etree
def test_pachong():
url = r"https://www.58pic.com/piccate/10-0-0-p1.html"
headers = {'user-agent': 'my-test/0.0.1','Referer':'https://www.58pic.com/piccate/10-0-0-p1.html'}
html = requests.get(url,headers).text
soup = BeautifulSoup(html, 'lxml')
infoData = soup.find_all(name='a', attrs={'class': 'thumb-box'})
num = len(infoData)
for j in range(num):
picinfo = str(infoData[j])
if '''target="_blank"''' in picinfo:
rex1 = re.compile("href=\"(.+?)\"")
pic_url = rex1.findall(picinfo)[0]
try:
get_picInfo(pic_url)
except:
print('异常url是{}'.format(pic_url))
def get_picInfo(pic_url):
headers = {'user-agent': 'my-test/0.0.1','Referer':'https://www.58pic.com/piccate/10-0-0-p1.html'}
url = "https:"+pic_url
html = requests.get(url,headers).text
soup = BeautifulSoup(html, 'lxml')
infoData = soup.find_all(name='img',attrs={'class':'show-area-pic'})
num = len(infoData)
img_title = ''
img_info = ''
for j in range(num):
info = str(infoData[j])
rex1 = re.compile("title=\"(.+?)\"")
rex2 = re.compile("src=\"(.+?)\"")
infoData1 = rex1.findall(info)[0]
infoData2 = rex2.findall(info)[0]
img_url = "http:"+infoData2
img_title = infoData1+'.jpg'
print(img_title,'--url--',img_url)
if num > 1: #这里确认是图片被网站分成了两部分,暂时还没时间处理能拼接这两部分图片
if j == 0:
save_img('picture\\'+img_title,requests.get(img_url,headers).content)
else:
img_title = infoData1+str(j)+'.jpg'
save_img('picture\\'+img_title,requests.get(img_url,headers).content)
else:
save_img('picture\\'+img_title,requests.get(img_url,headers).content)
test_pachong()
最后就是保存我们通过url拿到的图片源requests.get(img_url,headers).content,代码如下:
def save_img(file_name,img):
'''保存图片'''
with open (file_name,'wb') as save_img:
save_img.write(img)
print ('正在下载{}'.format(file_name))