抓取千图网图片

最新推荐文章于 2025-05-07 13:20:02 发布

|Ψñ

最新推荐文章于 2025-05-07 13:20:02 发布

阅读量377

点赞数

分类专栏：爬虫文章标签： python scrapy 正则表达式

本文链接：https://blog.csdn.net/weixin_41641028/article/details/129421260

版权

爬虫专栏收录该内容

7 篇文章

订阅专栏

本文介绍了两种方法抓取千图网的指定单张图片，包括使用requests库实现UA伪装和简单下载。还探讨了批量爬取千图网图片的思路，强调了动态加载数据的处理，并提供了宠物露营物料合集的爬取示例，指出图片地址在"data-original"属性中。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

抓取千图网指定单张图片

方法一：编写麻烦，但是可以实现UA伪装

import requests
# URL of the image to download
url = 'https://www.example.com/image.jpg'
# Send an HTTP GET request to the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
    # Open a file in binary mode and write the response content to it
    with open('image.jpg', 'wb') as f:
        f.write(response.content)
        print('Image saved successfully')
else:
    print('Failed to download image')

在这个例子中，我们使用requests库向图像URL发送一个HTTP GET请求。我们检查请求是否成功（状态代码200），然后用二进制模式（'wb'）的open()函数将图像保存到一个名为image.jpg的文件中。我们使用文件对象的write()方法将响应的内容写到文件中。

注意，文件名可以改成任何想要的名字，文件扩展名应该与被下载的图像的格式一致。

方法二：编写简单，但是无法实现UA伪装

import urllib.request

# URL of the image to download
url = "https://www.example.com/image.jpg"

# Filename to save the image to
filename = "image.jpg"

# Download the image and save it to the specified file
urllib.request.urlretrieve(url, filename)

在这个例子中，url变量应该被替换成你想下载的图片的URL，而filename变量应该被设置成你想在文件被保存到磁盘时给它的名字。urlretrieve函数将下载该文件并将其保存到指定的文件名。

批量爬取千图网图片

案例应用：批量爬取图片

宠物露营物料合集url:https://www.58pic.com/c/25981671

大致思路

将页面每一张图片的图片地址解析出来

对图片地址发起请求获取图片数据，然后进行持久化存储

注意：

开发者工具中Elements选项卡中显示的页面源码数据（会包含动态加载数据）：

是通过所有数据包请求到的数据渲染完毕后的完整数据

抓包工具Network中response显示的页面源码数据：（不包含动态加载数据）

其实就是通过requests模块对该数据包的url发起请求获取的响应数据

单独该数据包请求到的响应数据

import requests
import urllib.request
import os
import re

url = 'https://www.58pic.com/c/25981671'
response = requests.get(url)

# 获取页面源码数据
page_html = response.text

# 创建文件夹用于存储图片
if not os.path.exists('./image'):
    os.mkdir('./image')

# 用正则表达式提取图片地址
ex = '<img class="lazy".*?data-original="(.*?)".*?/>'
img_src_list = re.findall(ex, page_html, re.S)
# print(img_src_list)
# 遍历图片地址并下载
for img_src in img_src_list:
    img_src = f"https:{img_src}"
    print(img_src)
    img_name = img_src.split('/')[-1]
    if not img_name.endswith('.png'):
        img_name = img_name+".png"

    print(f'Downloading {img_name}...')
    # 用urlretrieve对图片地址发起请求，并把图片数据做持久化存储
    urllib.request.urlretrieve(img_src, f"./image/{img_name}")

    print(f'{img_name} downloaded successfully.')

千图网宠物露营物料合集页面,在开发这工具Elements选项卡中显示的页面源码数据包含动态加载数据

抓包工具Network中response显示的页面源码数据，按"ctrl+F"局部搜索"data-original",看到<img标签中"src"后面没有图片地址，图片地址在"data-original"属性中。

批量爬取妈蛋表情网图片

"""爬取 http://md.itlun.cn/a/nhtp/ 网站图片，并保存到文件夹"""
import requests
from urllib import request
import os
import re
main_url = 'http://md.itlun.cn/a/nhtp/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
response = requests.get(url=main_url, headers=headers)
response.encoding = "gbk"
# 获取页面源码数据
page_html = response.text
# print(page_html)
# 数据解析，获取图片地址
#数据解析：解析图片的地址
# ex = '<li>.*?<img.*?src="(.*?)" style.*?</li>'
#re.S用来处理回车
# img_src_list = re.findall(ex,page_text,re.S)
#注意：如果确认正则没有写错，则去关注下正则作用到的页面源码是否出现问题
#极有可能页面源码数据出现了动态加载的情况
#在抓包工具中，查看了数据包的响应数据，发现img，li标签都是大写，而我们写的正则匹配的是小写标签，因此匹配失败
# ex = '<LI>.*?<IMG.*?src="(.*?)" style.*?</LI>'
# img_src_list = re.findall(ex,page_text,re.S)
#问题：提取到的图片地址都是一样的。如何解决？继续查看抓包工具的源码
#发现：真正的图片地址是有js动态加载出来的
ex = '<script .*?src = "(.*?)"; </script>'
# re.S 用来处理回车
img_src_list = re.findall(ex, page_html, re.S)
if not os.path.exists("./imgs"):
    os.mkdir("./imgs")
for img_src in img_src_list:
    # print(img_src)
    #发现解析出的图片地址，是不完整的，缺少http:
    img_src = "http:"+img_src
    # 发送请求，下载图片
    img_response = requests.get(img_src)
    # 获取图片名称
    img_name = img_src.split('/')[-1]
    # 保存图片路径
    img_path = f'./imgs/{img_name}'
    # 保存图片到本地文件夹
    with open(img_path, 'wb') as fp:
        fp.write(img_response.content)
    print(img_name+'下载完毕！')
print('download all！！')