话不多说,先上代码
import re
import os
import requests
start_url = "http://pic.onlylady.com/2019/0520/"
yuan_url = "http://pic.onlylady.com"
now_url = start_url
depth = 50
#depth 为爬取图集的数量
start_num = 3959092
now_num = start_num
dir_name = "d:/hello50"
furl = now_url + str(now_num)+".shtml"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
for i in range(depth):
#print(furl)
response = requests.get(furl)
html = response.text
head_jpg_png = re.findall('<a href="(https?://.*?\.jpg|https?://.*?\.png)" class="fullscreen" id="fullscreen" target=".*?">', html)
head_jpg_png = ''.join(head_jpg_png)
#head_jpg为一个图集中的首图片url
print("head_jpg:"+head_jpg_png)
my_look = '<img src="(https?://.*?\.jpg|https?://.*?\.png)" rel="{}" rev="{}" alt=".*?">'.format(head_jpg_png, head_jpg_png)
print(my_look)
jpg_urls = re.findall(my_look, html)
print(jpg_urls)
##下载这一图集的全部图片
for url in jpg_urls:
url_split = url.split('/')
file_name = url.split('/')[-1]
size = "985x695"
print(url_split)
url_split[3] = size
for fen_url in url_split:
url = fen_url+'/'
url = '/'.join(url_split)
print(url_split)
response = requests.get(url)
with open(dir_name+'/'+file_name, 'wb') as f:
f.write(response.content)
furl = re.findall('<a title=".*?" href="(.*?)"><img width=".*?" height=".*?" src=".*?" alt=".*?">', html)
furl = ''.join(furl)
furl = yuan_url+furl
#print("!!!")
print(furl)
这一段代码的功能是对http://pic.onlylady.com这个网站的部分图片进行爬取,起始爬取网址为:http://pic.onlylady.com/2019/0520/3959092.shtml
观察这个展示我们可以发现图片是以图集的形式展示的,每个图集有若干张图片,一个图集中的图片其url是连续的,因此可以一并处理。但不同图集的url是不连续的,需要分别下载。
下面进行代码的解释:
import re
import os
import requests
首先引入三个库:re、os、requests库。re库为正则表达式库,其可以用正则表达式匹配到我们在网页中所想要获取的内容;os为操作系统库,这个库用来创建文件夹以保存我们爬到的图片;requests库是一个Python第三方库,处理URL资源特别方便。
yuan_url = "http://pic.onlylady.com"
depth = 50
#depth 为爬取图集的数量
dir_name = "d:/hello50"
furl = "http://pic.onlylady.com/2019/0520/3959092.shtml"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
这一段主要定义了几个变量:yuan_url用来保存所爬取的网站主页面,
now_url用在后面的for循环中,表示当前爬取的url
depth为爬取图集数量,这个自行设置,不多说。
dir_name为文件路径。
furl此时为起始爬取网页的url
后面的if语句意思为如果这个d:/hello50没有被创建过,那么就创建一个。
for i in range(depth):
#print(furl)
response = requests.get(furl)
html = response.text
head_jpg_png = re.findall('<a href="(https?://.*?\.jpg|https?://.*?\.png)" class="fullscreen" id="fullscreen" target=".*?">', html)
head_jpg_png = ''.join(head_jpg_png)
#head_jpg为一个图集中的首图片url
print("head_jpg:"+head_jpg_png)
my_look = '<img src="(https?://.*?\.jpg|https?://.*?\.png)" rel="{}" rev="{}" alt=".*?">'.format(head_jpg_png, head_jpg_png)
print(my_look)
jpg_urls = re.findall(my_look, html)
print(jpg_urls)
##下载这一图集的全部图片
for url in jpg_urls:
url_split = url.split('/')
file_name = url.split('/')[-1]
size = "985x695"
print(url_split)
url_split[3] = size
for fen_url in url_split:
url = fen_url+'/'
url = '/'.join(url_split)
print(url_split)
response = requests.get(url)
with open(dir_name+'/'+file_name, 'wb') as f:
f.write(response.content)
now_url = re.findall('<a title=".*?" href="(.*?)"><img width=".*?" height=".*?" src=".*?" alt=".*?">', html)
now_url = ''.join(now_url)
now_url = yuan_url+now_url
print(now_url)
这一段是整个代码的核心部分。首先
外层的for循环爬取每一个图集,其内部是爬取一个图集的具体操作;内层for循环是爬取一个图集中每一张图片的具体操作,具体语句明天再细说。