美女图片的Python爬虫实例:面向服务器版
该爬虫面向成年人且有一定的自控能力(涉及部分性感图片,仅用于爬虫实例研究)
前言
初始教程
初始教程详见:饱暖思淫欲之美女图片的Python爬虫实例(一)
该教程面向服务器版本,为原始升级版
存在问题
- 原始爬虫存在部分逻辑问题
- 在笔记本上爬图片太浪费资源(网速、磁盘、内存+一部分CPU)
- 在对该网站爬取过程出现
远程主机关闭现有连接
,需要手动重新爬取
解决思路
转向服务器,由服务器实现不间断爬虫
目标
- 实现网页上所有图片均可爬取
- 实现不间断爬取
- 实现断网重连
实现步骤
硬件配置
服务器信息
前不久在实验室找了几台不用的PC,拼拼凑凑组了个黑群晖(看图片谁用谁知道)当服务器(黑群晖安装请自行百度)
$ uname -a
# Linux Lab503Server 3.10.102 #15284 SMP Sat May 19 04:44:02 CST 2018 x86_64 GNU/Linux synology_broadwell_3617xs
$ cat /proc/cpuinfo
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
stepping : 7
microcode : 0x2d
cpu MHz : 3401.000
cache size : 8192 KB
代码实现
头部
由于群晖解析中文有点问题,搜了下在头部加了个编码说明(1 ~ 4行)在(7 ~ 8)行是用于取消ssh链接后的红色警告,socket用于设置超时时间
# coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import os
import time
import urllib3
urllib3.disable_warnings()
import socket
import requests
import re
初始配置
- 参考【Python爬虫错误】ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接解决了:
ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。')
- 设置超时等待时间是60s,即60s后链接不上则取消连接
- 参考python 爬虫:https; HTTPSConnectionPool(host=‘z.jd.com’, port=443)修改
headers
用于修改链接方式减少链接数量,解决链接次数过多被远程主机拒绝连接
# 存储目录
dir = r"/volume2/Server/Study/Python/"
url = "https://pic.xrmn5.com"
# 解决ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。
# 链接:
# https://blog.csdn.net/IllegalName/article/details/77164521
# 设置超时时间
time_out = 60
socket.setdefaulttimeout(20*time_out)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 Edg/92.0.902.84',
'Connection':'close'
}
URL = "https://www.xrmn5.com/XiuRen/"
WebURL = "https://www.xrmn5.com/"
获取首页及总共需爬取的写真专辑页数
这里根据警告对应链接https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings添加了requests.packages.urllib3.disable_warnings()
用于去除添加ssh无认证下的红色警告
而添加ssh无认证和超时时间在下面:用于解决链接次数过多无法连接问题
requests.get(URL,headers=headers,timeout=time_out,verify=False)
time.sleep(1)
# 去除添加ssh无认证下的红色警告
# https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
requests.packages.urllib3.disable_warnings()
Get_url = requests.get(URL,headers=headers,timeout=time_out,verify=False)
Get_url.encoding = 'utf-8'
Get_html = Get_url.text
# 关闭当前链接
Get_url.close()
# print(Get_html)
patrenForPageNum = '</a><a href=\"(.*?)\">'
Get_PageNum = re.compile(patrenForPageNum,re.S).findall(Get_html)
temp = str(Get_PageNum[len(Get_PageNum)-1])
# 获取该网站共有多少页
PageNum = "".join(list(filter(str.isdigit, temp)))
# 获取所有网页,存入AllPage中
AllPageTemp = []
GetAllPage = ()
for i in range(int(PageNum)):
if i > 0:
AllPageTemp.append(WebURL+"/XiuRen/index"+str(i+1)+".html")
GetAllPage += tuple(AllPageTemp)
print(len(GetAllPage))
重点部分
变量说明
urls
: 这是用于存储当前页的所有写真专辑信息,包括网页urls[i][0]
、标题urls[i][1]
和封面地址urls[i][2]
inforName
:这里是存储当前页的所有写真专辑对应的人物信息,包括艺名inforName[i][0]
和标题inforName[i][1]
likeNum
:这里是用于存储当前页的所有写真专辑对应的创建时间likeNum[i][0]
和查看次数likeNum[i][1]
getImgDir
:专辑对应图片的本地保存地址file_num
:获取该专辑的所有图片数量xx,即(xxP)imgUrl
:专辑对应封面图片的下载网址imgName
:封面保存名称IntoPageUrl
:封面对应专辑的具体网页Get_InPage
:获取封面对应专辑的具体网页Get_InPagehtml
:获取封面对应专辑的具体网页的文本格式AllPage
:获取该封面对应专辑的所有网页链接AllPage[k][0]
imgPageUrl
:获取当前页的图片链接PageNum
:获取当前页的图片数量GetPageImg
:当前页的图片完整链接url + imgPageUrl[l]
PageImgeName
:当前页对应图片的存储路径getImgDir + imgPageUrl[l].split('/')[-1]
Get_PImg
:由GetPageImg
获取的对应网页二进制数据,即图片本体数据NewPaperUrl
:获取封面对应专辑的具体网页的下一页网页数据Get_url
:获取该网站下一页网页数据,再由urls
来获取当前页的所有写真专辑信息Get_html
:获取该网站下一页网页的文本格式
实现说明
- 通过判断当前目录的文件数
len(os.listdir(getImgDir)
与专辑对应图片数int(file_num[12:]))
,保证已下载的目录会被跳过,减少时间浪费 - 通过
try ……except Exception as e
的方式,在try中如果get网页不成功,将错误扔至except Exception
,再试一次
代码实现
# 开始从首页获取
for pagenum in range(int(PageNum)):
urls = re.findall('<li class="i_list list_n2"><a href=\"(.*?)\" alt=(.*?) title=.*?><img class="waitpic" src=\"(.*?)\"', Get_html)
patren1 = '<div class="postlist-imagenum"><span>(.*?)</span></div></a><div class="case_info"><div class="meta-title">\[.*?\](.*?)</a></div>'
patren2 = '<div class="meta-post"><i class="fa fa-clock-o"></i>(.*?)<span class="cx_like"><i class="fa fa-eye"></i>(.*?)</span>'
inforName = re.compile(patren1, re.S).findall(Get_html)
likeNum = re.compile(patren2, re.S).findall(Get_html)
print(urls)
print(inforName)
print(likeNum)
num = len(likeNum)
patren3 = '<img οnlοad=.*? alt=.*? title=.*? src=\"(.*?)\" />'
for i in range(num):
if (int(likeNum[i][1]) > 500):
getImgDir = dir + str(inforName[i][0]) + '/' + str(likeNum[i][0]) + '/' + str(inforName[i][1] + '/')
file_num = "".join(list(filter(str.isdigit, getImgDir)))
# 创建对应目录
if not os.path.exists(getImgDir):
os.makedirs(getImgDir)
else:
print("此目录已存在:",getImgDir)
if (len(os.listdir(getImgDir)) >= (int(file_num[12:]))):
continue
imgUrl = url + urls[i][2]
imgName = getImgDir + urls[i][2].split('/')[-1]
print(imgUrl,imgName)
# 获取封面图片
if os.path.isfile(imgName):
print("此封面已存在:", imgName)
else:
time.sleep(1)
try:
requests.packages.urllib3.disable_warnings()
Get_Img = requests.get(imgUrl, headers=headers,timeout=time_out,verify=False)
with open(imgName, 'wb') as f:
f.write(Get_Img.content)
Get_Img.close()
except Exception as e:
print('get the first img with the error: %s ' % e)
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_Img = requests.get(imgUrl, headers=headers, timeout=time_out, verify=False)
with open(imgName, 'wb') as f:
f.write(Get_Img.content)
Get_Img.close()
# 进入具体网页
IntoPageUrl = WebURL + urls[i][0]
print("当前写真网页为:",IntoPageUrl)
time.sleep(1)
try:
requests.packages.urllib3.disable_warnings()
Get_InPage = requests.get(IntoPageUrl, headers=headers,timeout=time_out,verify=False)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
Get_InPage.close()
except Exception as e:
print('get the img page with the error: %s ' % e)
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_InPage = requests.get(IntoPageUrl, headers=headers, timeout=time_out, verify=False)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
Get_InPage.close()
AllPage = re.findall('</a><a href=\"(.*?)\">([0-9]*)', Get_InPagehtml)
for k in range(len(AllPage)):
imgPageUrl = re.compile(patren3, re.S).findall(Get_InPagehtml)
PageNum = len(imgPageUrl)
# 循环获取并保存图片
for l in range(PageNum):
GetPageImg = url + imgPageUrl[l]
print(GetPageImg)
PageImgeName = getImgDir + imgPageUrl[l].split('/')[-1]
print(PageImgeName)
# 获取内部图片
if os.path.isfile(PageImgeName):
print("此图片已存在:",PageImgeName)
continue
else:
try:
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_PImg = requests.get(GetPageImg, headers=headers,timeout=time_out,verify=False)
with open(PageImgeName, 'wb') as f:
f.write(Get_PImg.content)
Get_PImg.close()
except Exception as e:
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_PImg = requests.get(GetPageImg, headers=headers, timeout=time_out,verify=False)
print('get the next img with the error: %s ' % e)
with open(PageImgeName, 'wb') as f:
f.write(Get_PImg.content)
Get_PImg.close()
if k == len(AllPage) - 1:
print("当前信息:",AllPage[k])
continue
# 继续下一页获取图片
NewPaperUrl = WebURL + AllPage[k][0]
print("开始下一页:",NewPaperUrl)
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_InPage = requests.get(NewPaperUrl, headers=headers,timeout=time_out)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
Get_InPage.close()
print("开始下一轮:",GetAllPage[pagenum])
try:
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_url = requests.get(GetAllPage[pagenum],headers=headers,timeout=time_out)
Get_url.encoding = 'utf-8'
Get_html = Get_url.text
Get_url.close()
except Exception as e:
print('get the next info page with the error: %s ' % e)
requests.packages.urllib3.disable_warnings()
Get_url = requests.get(GetAllPage[pagenum],headers=headers,timeout=time_out)
Get_url.encoding = 'utf-8'
Get_html = Get_url.text
Get_url.close()
另附:完整代码
# coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import os
import time
import urllib3
urllib3.disable_warnings()
import socket
import requests
import re
dir = r"/volume2/Server/Study/Python/"
url = "https://pic.xrmn5.com"
# 解决ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。
# 链接:
# https://blog.csdn.net/IllegalName/article/details/77164521?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task
# 设置超时时间
time_out = 60
socket.setdefaulttimeout(20*time_out)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 Edg/92.0.902.84',
'Connection':'close'
}
URL = "https://www.xrmn5.com/XiuRen/"
WebURL = "https://www.xrmn5.com/"
time.sleep(1)
# 去除添加ssh无认证下的红色警告
# https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
requests.packages.urllib3.disable_warnings()
Get_url = requests.get(URL,headers=headers,timeout=time_out,verify=False)
Get_url.encoding = 'utf-8'
Get_html = Get_url.text
# 关闭当前链接
Get_url.close()
# print(Get_html)
patrenForPageNum = '</a><a href=\"(.*?)\">'
Get_PageNum = re.compile(patrenForPageNum,re.S).findall(Get_html)
temp = str(Get_PageNum[len(Get_PageNum)-1])
PageNum = "".join(list(filter(str.isdigit, temp)))
# 获取所有网页,存入AllPage中
AllPageTemp = []
GetAllPage = ()
for i in range(int(PageNum)):
if i > 0:
AllPageTemp.append(WebURL+"/XiuRen/index"+str(i+1)+".html")
GetAllPage += tuple(AllPageTemp)
print(len(GetAllPage))
for pagenum in range(int(PageNum)):
urls = re.findall('<li class="i_list list_n2"><a href=\"(.*?)\" alt=(.*?) title=.*?><img class="waitpic" src=\"(.*?)\"', Get_html)
patren1 = '<div class="postlist-imagenum"><span>(.*?)</span></div></a><div class="case_info"><div class="meta-title">\[.*?\](.*?)</a></div>'
patren2 = '<div class="meta-post"><i class="fa fa-clock-o"></i>(.*?)<span class="cx_like"><i class="fa fa-eye"></i>(.*?)</span>'
inforName = re.compile(patren1, re.S).findall(Get_html)
likeNum = re.compile(patren2, re.S).findall(Get_html)
print(urls)
print(inforName)
print(likeNum)
num = len(likeNum)
patren3 = '<img οnlοad=.*? alt=.*? title=.*? src=\"(.*?)\" />'
for i in range(num):
if (int(likeNum[i][1]) > 500):
getImgDir = dir + str(inforName[i][0]) + '/' + str(likeNum[i][0]) + '/' + str(inforName[i][1] + '/')
file_num = "".join(list(filter(str.isdigit, getImgDir)))
# 创建对应目录
if not os.path.exists(getImgDir):
os.makedirs(getImgDir)
else:
print("此目录已存在:",getImgDir)
if (len(os.listdir(getImgDir)) >= (int(file_num[12:]))):
continue
imgUrl = url + urls[i][2]
imgName = getImgDir + urls[i][2].split('/')[-1]
print(imgUrl,imgName)
# 获取封面图片
if os.path.isfile(imgName):
print("此封面已存在:", imgName)
else:
time.sleep(1)
try:
requests.packages.urllib3.disable_warnings()
Get_Img = requests.get(imgUrl, headers=headers,timeout=time_out,verify=False)
with open(imgName, 'wb') as f:
f.write(Get_Img.content)
Get_Img.close()
except Exception as e:
print('get the first img with the error: %s ' % e)
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_Img = requests.get(imgUrl, headers=headers, timeout=time_out, verify=False)
with open(imgName, 'wb') as f:
f.write(Get_Img.content)
Get_Img.close()
# 进入具体网页
IntoPageUrl = WebURL + urls[i][0]
print("当前写真网页为:",IntoPageUrl)
time.sleep(1)
try:
requests.packages.urllib3.disable_warnings()
Get_InPage = requests.get(IntoPageUrl, headers=headers,timeout=time_out,verify=False)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
Get_InPage.close()
except Exception as e:
print('get the img page with the error: %s ' % e)
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_InPage = requests.get(IntoPageUrl, headers=headers, timeout=time_out, verify=False)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
Get_InPage.close()
AllPage = re.findall('</a><a href=\"(.*?)\">([0-9]*)', Get_InPagehtml)
for k in range(len(AllPage)):
imgPageUrl = re.compile(patren3, re.S).findall(Get_InPagehtml)
PageNum = len(imgPageUrl)
# 循环获取并保存图片
for l in range(PageNum):
GetPageImg = url + imgPageUrl[l]
print(GetPageImg)
PageImgeName = getImgDir + imgPageUrl[l].split('/')[-1]
print(PageImgeName)
# 获取内部图片
if os.path.isfile(PageImgeName):
print("此图片已存在:",PageImgeName)
continue
else:
try:
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_PImg = requests.get(GetPageImg, headers=headers,timeout=time_out,verify=False)
with open(PageImgeName, 'wb') as f:
f.write(Get_PImg.content)
Get_PImg.close()
except Exception as e:
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_PImg = requests.get(GetPageImg, headers=headers, timeout=time_out,verify=False)
print('get the next img with the error: %s ' % e)
with open(PageImgeName, 'wb') as f:
f.write(Get_PImg.content)
Get_PImg.close()
if k == len(AllPage) - 1:
print("当前信息:",AllPage[k])
continue
# 继续下一页获取图片
NewPaperUrl = WebURL + AllPage[k][0]
print("开始下一页:",NewPaperUrl)
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_InPage = requests.get(NewPaperUrl, headers=headers,timeout=time_out)
Get_InPage.encoding = 'utf-8'
Get_InPagehtml = Get_InPage.text
Get_InPage.close()
print("开始下一轮:",GetAllPage[pagenum])
try:
time.sleep(1)
requests.packages.urllib3.disable_warnings()
Get_url = requests.get(GetAllPage[pagenum],headers=headers,timeout=time_out)
Get_url.encoding = 'utf-8'
Get_html = Get_url.text
Get_url.close()
except Exception as e:
print('get the next info page with the error: %s ' % e)
requests.packages.urllib3.disable_warnings()
Get_url = requests.get(GetAllPage[pagenum],headers=headers,timeout=time_out)
Get_url.encoding = 'utf-8'
Get_html = Get_url.text
Get_url.close()
服务器部署
实现后台运行
- xshell 连接服务器进入root权限(自行百度)
sudo -i
- 创建对应文件夹
/volume2/Server/Study/Python/
(自行百度)
群晖对应磁盘为 /volume + num, 我扔给了磁盘2,对应路径即 /volume2
mkdir -p /volume2/Server/Study/Python/
- xftp 将py文件扔给对应文件夹(自行百度)
- 后台运行该py文件并输出log文件
注意可能会出现包未导入等情况,然而在用pip安装对应包可能出现pip没有,需要安装,自行百度
可参考群晖Nas下安装Python3及 PIP
所有包均安装成功则 xshell 运行:
nohup python -u /volume2/Server/Study/Python/AllImgGet-WebAll.py > /volume2/Server/Study/Python/out.log 2>&1 &
- 查看日志信息(自行百度)
cat /volume2/Server/Study/Python/out.log
- 退出xshell(非常重要)
需要由exit
命令退出,否则后台运行程序在xshell关闭后自动关闭
关闭后台运行
- 查找后台运行程序
ps aux | grep py
- 找到对应后台运行程序的pid,kill即可
kill 14428
服务器运行结果
至于爬到的图具体是啥,就自行研究吧