睡觉也在爬虫之二（爬一组图片）

最新推荐文章于 2024-04-30 20:20:49 发布

活着Viva

最新推荐文章于 2024-04-30 20:20:49 发布

阅读量271

点赞数

分类专栏：睡觉也在爬虫文章标签：爬虫 python 开发语言

本文链接：https://blog.csdn.net/weixin_43623271/article/details/122494386

版权

睡觉也在爬虫专栏收录该内容

5 篇文章 1 订阅

订阅专栏

爬一组图片

爬虫思路
实现过程
完整代码
效果

书接上回
https://blog.csdn.net/weixin_43623271/article/details/122492700

爬虫思路

找到组图的页码栏

当前是第一页，也是第一张图的地址
在这里插入图片描述

找到总页数的元素

在这里插入图片描述
用xpath语法匹配

查看每张图url的规律

第一张图url
在这里插入图片描述
第二张图url

可以推测出图url规律
236000_x.html

循环访问图片url并找出存放jpg的元素

每个图片的地址源代码中都会有其jpg地址
在这里插入图片描述
这里也用xpath匹配上图片的标题，写入文件的时候可以用此名作为文件名

循环写入文件

请求相应的jpg内容，并写入文件

实现过程

定义请求头，url变量

headers = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36',
    'cookie':'Hm_lvt_c08bad6ac66a035b30e72722f365229b=1642073389; Hm_lpvt_c08bad6ac66a035b30e72722f365229b=1642073420',
    'referer':'https://www.tupianzj.com/meinv/mm/meizitu/',
}
picburp_num = '236000'
url = 'https://www.xxx.com/meinv/20211208/'+picburp_num+'.html'

源码解析成html

# 将源码解析成html
r = requests.get(url,headers=headers)
html = etree.HTML(r.content.decode('gb2312'))

找到总页数和图片元素

# 用xpath选择器匹配图片元素
pic_tag = html.xpath('//div[@id="bigpic"]//img/@src') #列表

# 总页数
pic_num_tag = html.xpath('//div[@class="pages"]//a/text()')[0]
num = re.findall(r'\d+',pic_num_tag)[0]

创建目录，用于存储图片

#创建文件夹用于保存图片
dir_path = 'e:/meizitu'
if not os.path.exists(dir_path):
    os.mkdir(dir_path)
    print(f"[+] 创建{dir_path.split('/')[1]}文件夹")

循环总页数,并写入文件

# 第一张图片236000.html ,第二张图片236000_2.html,第一张图片没规律所以分类讨论
for i in range(1,int(num)+1):
    if i == 1:
        url = 'https://www.xxx.com/meinv/20211208/236000.html'
    else:
        url = 'https://www.xxx.com/meinv/20211208/236000_'+str(i)+'.html'
    r = requests.get(url,headers=headers)
    html = etree.HTML(r.content.decode('gb2312'))
    # 用xpath选择器匹配指定元素
    pic_tag = html.xpath('//div[@id="bigpic"]//img/@src') #列表
    pic_jpg_src = pic_tag[0]
    pic_name_tag = html.xpath('//div[@class="list_con bgff"]/h1/text()')
    pic_name = pic_name_tag[0]
    pic_html_path = dir_path+'/'+pic_name+'.jpeg'
    pic_content = requests.get(pic_jpg_src,headers=headers)
    with open(pic_html_path,'wb') as f:
        f.write(pic_content.content)
        print(f'[+] {pic_name}下载完毕')

完整代码

import requests
from lxml import etree
import os
import re

headers = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36',
    'cookie':'Hm_lvt_c08bad6ac66a035b30e72722f365229b=1642073389; Hm_lpvt_c08bad6ac66a035b30e72722f365229b=1642073420',
    'referer':'https://www.tupianzj.com/meinv/mm/meizitu/',
}
picburp_num = '236000'
url = 'https://www.xxx.com/meinv/20211208/'+picburp_num+'.html'

# 将源码解析成html
r = requests.get(url,headers=headers)
html = etree.HTML(r.content.decode('gb2312'))

# 用xpath选择器匹配指定元素
pic_tag = html.xpath('//div[@id="bigpic"]//img/@src') #列表

# 总页数
pic_num_tag = html.xpath('//div[@class="pages"]//a/text()')[0]
num = re.findall(r'\d+',pic_num_tag)[0]

#创建文件夹用于保存图片
dir_path = 'e:/meizitu'
if not os.path.exists(dir_path):
    os.mkdir(dir_path)
    print(f"[+] 创建{dir_path.split('/')[1]}文件夹")

# 第一张图片236000.html ,第二张图片236000_2.html,第一张图片没规律所以分类讨论
for i in range(1,int(num)+1):
    if i == 1:
        url = 'https://www.tupianzj.com/meinv/20211208/236000.html'
    else:
        url = 'https://www.tupianzj.com/meinv/20211208/236000_'+str(i)+'.html'
    r = requests.get(url,headers=headers)
    html = etree.HTML(r.content.decode('gb2312'))
    # 用xpath选择器匹配指定元素
    pic_tag = html.xpath('//div[@id="bigpic"]//img/@src') #列表
    pic_jpg_src = pic_tag[0]
    pic_name_tag = html.xpath('//div[@class="list_con bgff"]/h1/text()')
    pic_name = pic_name_tag[0]
    pic_html_path = dir_path+'/'+pic_name+'.jpeg'
    pic_content = requests.get(pic_jpg_src,headers=headers)
    with open(pic_html_path,'wb') as f:
        f.write(pic_content.content)
        print(f'[+] {pic_name}下载完毕')

效果

在这里插入图片描述

活着Viva

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
睡觉也在爬虫之二（爬一组图片）

爬一组图片爬虫思路找到组图的页码栏找到总页数的元素查看每张图url的规律循环访问图片url并找出存放jpg的元素循环写入文件实现过程完整代码效果书接上回https://blog.csdn.net/weixin_43623271/article/details/122492700爬虫思路找到组图的页码栏当前是第一页，也是第一张图的地址找到总页数的元素用xpath语法匹配查看每张图url的规律第一张图url第二张图url可以推测出图url规律236000_x.html循环访问
复制链接

扫一扫