python 图片爬虫心得

最新推荐文章于 2024-05-13 01:00:11 发布

${老夫的少女心}

最新推荐文章于 2024-05-13 01:00:11 发布

阅读量1.2k

点赞数 1

分类专栏： python爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_32828053/article/details/118669931

版权

python爬虫专栏收录该内容

7 篇文章 5 订阅

订阅专栏

python 图片爬虫心得

一、流程
1、熟悉需要爬虫网页的基本信息
2、在网页源中找到图片的url并尝试打开
3、编写python脚本
4、执行脚本下载图片

二、熟悉爬虫网页的基本信息
爬虫之前首先你需要了解你爬虫的图片所在的网页的信息并根据这些信息找到图片，从而找到图片的url下载链接，进行下载尝试，如果成功说明OK可以把这个方式进行python脚本化批量执行。
以萌图社：https://moetu.club/612.html为例：
打开网页是这样的：
萌图社网页
其中的图片是我们需要的，也就是我们需要下载的图片
一般都藏在网页的源码中，所以我们直接查看源码
查看源代码
你就会发现，一堆代码看着费劲，这个时候别着急，手把手教你怎么找图片
在这里插入图片描述
网页中具体图片最近的标志性文字是【月餅文蝶】我们复制它，然后在页面进行搜索

你就会发现，向下一划唉，这几个蓝色的是链接，链接看着还很眼熟，是图片的链接

让我们打开看一看

确实是我们需要的图片，这一阶段的任务就完成了，只要找到我们需要的图片url我们就能下载了。

部分网页的源码中可能只显示图片部分的url，需要我们自己拼接一下
例：‘https://tva1.sinaimg.cn/large/’ + 变量 + ‘.jpg’

三、编写python脚本

1、编写方法函数
分析：下载图片需要使用到的步骤，
（1）首先是根据网页的url打开源代码，然后在源代码中正则提取出图片的url
（2）把提取出的url进行下载
（3）传入url调用（1）方法，然后把1方法返回的值传给（2）方法进行下载
2、编写获取图片url的方法：

import urllib.request   # 用于链接的操作
import re        # 用于正则提取数据

# 获取url中的图片链接地址
def img_png(url):
    page = urllib.request.urlopen(url) # 用于打开一个远程的url，并向链接发送请求，获取结果
    html_a = page.read()  # 获取页面的源码
    html_b = html_a.decode('utf-8')  # 源码格式化
    png = re.findall(r'img src="(.*?)" /><br',html_b)    # 正则提取出需要的图片链接信息
    # 正则表达式中可以根据左右边界来直接截取需要的数据，(.*?)代表需要截取的内容
    # 文中把img src="     和   " /><br    作为作为边界提取出来的恰恰是我们所需要的
    print('这是我们获取到的图片链接地址')
    print(png)
    return png

url = 'https://moetu.club/612.html'
a = img_png(url)

让我们来看一下执行后的结果吧：

这是我们获取到的图片链接地址
['https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymkutkj30lo0uttgh.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymixp2j30lo0feq5z.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymn15vj30lo0v30zw.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymli98j30lo0v30yv.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymmidtj30lo0v3tdy.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymo3qij30lo0v379y.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymrt8uj30lo0v3jyx.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkyomlvij30lo0v3tfe.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymxdlkj30lo0v3jyi.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymzxnfj30lo0v3n4i.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymza5cj30lo0v310d.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkyn2tqdj30lo0v3agn.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkyn0ulfj30lo0v3wje.jpg']

这个时候我们访问单个的链接就可以直接看到图片。
至此为我们从url获取图片的链接的方法已经完成了，需要使用的时候直接调用此方法就可以了；

3、编写下载图片的方法

# 下面是编写图片下载的方法
def img_downlad(list):
    # 设置需要保存文件的位置
    path = 'D:\\test1'
    # 判断文件夹存在不存在，不存在则进行创建
    if not os.path.isdir(path):
        os.makedirs(path)
    path = path +'\\'  # 图片最后保存的文件夹位置
    x = 0      # 添加一个变量给图片命名
    # for 循环来循环提取图片的链接的地址并进行下载
    for list_d in list:
        print(list_d)
        try:  # 错误判断机制，不会出错就卡主
            print('----------图片开始下载啦----------')
            urllib.request.urlretrieve(list_d,'{0}{1}.jpg'.format(path,x)) #下载图片到指定的位置
            x = x + random.randint(5,999999)    # 随机数保证每次取得不一致
        except:
            print('下载失败了')

可以自己写一个包含图片链接的列表然后调用下载函数执行查看是否下载成功
至此我们的下载图片方法就写完了。

4、调用两个方法实现图片的下载

url = 'https://moetu.club/612.html'  # 传入需要下载的url地址
a = img_png(url)    # 调用img_png 方法获取到网页上的图片链接地址
b = img_downlad(a)   # 调用img_downlad 方法下载图片到本地

下面是执行的结果

这是我们获取到的图片链接地址
['https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymkutkj30lo0uttgh.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymixp2j30lo0feq5z.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymn15vj30lo0v30zw.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymli98j30lo0v30yv.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymmidtj30lo0v3tdy.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymo3qij30lo0v379y.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymrt8uj30lo0v3jyx.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkyomlvij30lo0v3tfe.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymxdlkj30lo0v3jyi.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymzxnfj30lo0v3n4i.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymza5cj30lo0v310d.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkyn2tqdj30lo0v3agn.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gopkyn0ulfj30lo0v3wje.jpg']
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymkutkj30lo0uttgh.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymixp2j30lo0feq5z.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymn15vj30lo0v30zw.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymli98j30lo0v30yv.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymmidtj30lo0v3tdy.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymo3qij30lo0v379y.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymrt8uj30lo0v3jyx.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkyomlvij30lo0v3tfe.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymxdlkj30lo0v3jyi.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymzxnfj30lo0v3n4i.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkymza5cj30lo0v310d.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkyn2tqdj30lo0v3agn.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gopkyn0ulfj30lo0v3wje.jpg
----------图片开始下载啦----------

最后的成果展示
在这里插入图片描述
5、总的代码

import random           # 用于随机数的添加
import urllib.request   # 用于链接的操作
import re        # 用于正则提取数据
import os        # 用于文件的处理
# 获取url中的图片链接地址
def img_png(url):
    page = urllib.request.urlopen(url) # 用于打开一个远程的url，并向链接发送请求，获取结果
    html_a = page.read()  # 获取页面的源码
    html_b = html_a.decode('utf-8')  # 源码格式化
    png = re.findall(r'img src="(.*?)" /><br',html_b)    # 正则提取出需要的图片链接信息
    # 正则表达式中可以根据左右边界来直接截取需要的数据，(.*?)代表需要截取的内容
    # 文中把img src="     和   " /><br    作为作为边界提取出来的恰恰是我们所需要的
    print('这是我们获取到的图片链接地址')
    print(png)
    return png


# 下面是编写图片下载的方法
def img_downlad(list):
    # 设置需要保存文件的位置
    path = 'D:\\test1'
    # 判断文件夹存在不存在，不存在则进行创建
    if not os.path.isdir(path):
        os.makedirs(path)
    path = path +'\\'  # 图片最后保存的文件夹位置
    x = 0      # 添加一个变量给图片命名
    # for 循环来循环提取图片的链接的地址并进行下载
    for list_d in list:
        print(list_d)
        try:  # 错误判断机制，不会出错就卡主
            print('----------图片开始下载啦----------')
            urllib.request.urlretrieve(list_d,'{0}{1}.jpg'.format(path,x)) #下载图片到指定的位置
            x = x + random.randint(5,999999)    # 随机数保证每次取得不一致
        except:
            print('下载失败了')

# 需要下载图片的url
url = 'https://moetu.club/543.html'
# 调用img_png函数获取网页中的图片链接
a = img_png(url_list_one)
#调用img_downlad函数下载文件到本地
b = img_downlad(a)

四、拓展
1、从列表中获取每个包含图片的网页url地址
和从url上获取图片的链接的方法极为相似，也是从列表页面的源码上抓取到包含网页的url链接并且列表输出
然后把输出的列表用for循环取出url，再对这个url进行取图片的链接

url = 'https://moetu.club/category/illustration/page/2'
# 访问链接并获取列表的源码
page = urllib.request.urlopen(url).read().decode('utf-8')
# 把列表中的网页url读取出来
url_list = re.findall(r'href="(.*?)" rel="nofollow"',page)
# 一个for循环把url取出来传入img_png执行获取图片链接再调用img_downlad下载
for url_list_one in url_list:
    try:
        a = img_png(url_list_one)
        b = img_downlad(a)
    except:
        print('下载出错了')

执行结果：

这是我们获取到的图片链接地址
['https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1ngus2j30lo0ueq9d.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nfis1j30lo0uetev.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nmxucj30lo0uen4f.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nhhmpj30lo0ue0z8.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1n16lxj30lo0ukq86.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1mkg8vj30lo0uetd9.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1mgln6j30lo0esq6m.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1ncvp3j30lo0urq92.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nqaatj30lo0uen6h.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nclcnj30lo0ujq90.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nmr7pj30lo0uegsi.jpg']
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1ngus2j30lo0ueq9d.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nfis1j30lo0uetev.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nmxucj30lo0uen4f.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nhhmpj30lo0ue0z8.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1n16lxj30lo0ukq86.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1mkg8vj30lo0uetd9.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1mgln6j30lo0esq6m.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1ncvp3j30lo0urq92.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nqaatj30lo0uen6h.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nclcnj30lo0ujq90.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmqt1nmr7pj30lo0uegsi.jpg
----------图片开始下载啦----------
这是我们获取到的图片链接地址
['https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5ndoo2j30lo0utwlt.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nd32kj30lo0v6die.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5ncq1vj30lo0v6wlg.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nfr5gj30lo0v6ag7.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nf12yj30lo0v6450.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nec2oj30lo0v6qap.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nlpg1j30lo0v6n3p.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nn4dlj30lo0v6dkq.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nu8ujj30lo0v60xc.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nqdnaj30lo0vednt.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5np4tbj30lo0v6n57.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nopxtj30lo0f4tcl.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nx8eaj30lo0fiq80.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nuxabj30lo0v6794.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nurkaj30lo0v6jvg.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5ny748j30lo0vpdly.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nzqtvj30lo0v6dlo.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5o0toij30lo0v6tck.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5o18xqj30lo0v642r.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5o4ruij30lo0v60w3.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5o6giej30lo0v6q8d.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5o5visj30lo0v6dki.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5o8cm0j30lo0v6gsz.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5obk0bj30lo0v6wls.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5oa8zfj30lo0v6jx7.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5ock1lj30lo0v6tes.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5odwn4j30lo0v643j.jpg', 'https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5ofv37j30lo0vlwja.jpg']
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5ndoo2j30lo0utwlt.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nd32kj30lo0v6die.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5ncq1vj30lo0v6wlg.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nfr5gj30lo0v6ag7.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nf12yj30lo0v6450.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nec2oj30lo0v6qap.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nlpg1j30lo0v6n3p.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nn4dlj30lo0v6dkq.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nu8ujj30lo0v60xc.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5nqdnaj30lo0vednt.jpg
----------图片开始下载啦----------
https://tva1.sinaimg.cn/large/006RKGBpgy1gmof5np4tbj30lo0v6n57.jpg
----------图片开始下载啦----------

总的代码：

import random           # 用于随机数的添加
import urllib.request   # 用于链接的操作
import re        # 用于正则提取数据
import os        # 用于文件的处理
# 获取url中的图片链接地址
def img_png(url):
    page = urllib.request.urlopen(url) # 用于打开一个远程的url，并向链接发送请求，获取结果
    html_a = page.read()  # 获取页面的源码
    html_b = html_a.decode('utf-8')  # 源码格式化
    png = re.findall(r'img src="(.*?)" /><br',html_b)    # 正则提取出需要的图片链接信息
    # 正则表达式中可以根据左右边界来直接截取需要的数据，(.*?)代表需要截取的内容
    # 文中把img src="     和   " /><br    作为作为边界提取出来的恰恰是我们所需要的
    print('这是我们获取到的图片链接地址')
    print(png)
    return png


# 下面是编写图片下载的方法
def img_downlad(list):
    # 设置需要保存文件的位置
    path = 'D:\\test1'
    # 判断文件夹存在不存在，不存在则进行创建
    if not os.path.isdir(path):
        os.makedirs(path)
    path = path +'\\'  # 图片最后保存的文件夹位置
    x = 0      # 添加一个变量给图片命名
    # for 循环来循环提取图片的链接的地址并进行下载
    for list_d in list:
        print(list_d)
        try:  # 错误判断机制，不会出错就卡主
            print('----------图片开始下载啦----------')
            urllib.request.urlretrieve(list_d,'{0}{1}.jpg'.format(path,x)) #下载图片到指定的位置
            x = x + random.randint(5,999999)    # 随机数保证每次取得不一致
        except:
            print('下载失败了')

url = 'https://moetu.club/category/illustration/page/2'
# 访问链接并获取列表的源码
page = urllib.request.urlopen(url).read().decode('utf-8')
# 把列表中的网页url读取出来
url_list = re.findall(r'href="(.*?)" rel="nofollow"',page)
# 一个for循环把url取出来传入img_png执行获取图片链接再调用img_downlad下载
for url_list_one in url_list:
    try:
        a = img_png(url_list_one)
        b = img_downlad(a)
    except:
        print('下载出错了')

2、保存的图片名称获取自网页
在最后调用的img_png函数前把网页的标题提取出来
把提取的名称写入到img_downlad函数中的图片标题中去，下面是全部代码

import random           # 用于随机数的添加
import urllib.request   # 用于链接的操作
import re        # 用于正则提取数据
import os        # 用于文件的处理
# 获取url中的图片链接地址
def img_png(url):
    page = urllib.request.urlopen(url) # 用于打开一个远程的url，并向链接发送请求，获取结果
    html_a = page.read()  # 获取页面的源码
    html_b = html_a.decode('utf-8')  # 源码格式化
    png = re.findall(r'img src="(.*?)" /><br',html_b)    # 正则提取出需要的图片链接信息
    # 正则表达式中可以根据左右边界来直接截取需要的数据，(.*?)代表需要截取的内容
    # 文中把img src="     和   " /><br    作为作为边界提取出来的恰恰是我们所需要的
    print('这是我们获取到的图片链接地址')
    print(png)
    return png

# 下面是编写图片下载的方法
def img_downlad(list,name):
    # 设置需要保存文件的位置
    path = 'D:\\test1'
    # 判断文件夹存在不存在，不存在则进行创建
    if not os.path.isdir(path):
        os.makedirs(path)
    path = path +'\\'  # 图片最后保存的文件夹位置
    x = 0      # 添加一个变量给图片命名
    jpg_name = '{0}{1}' + str(name) + '.jpg' # 设置图片的名称
    # for 循环来循环提取图片的链接的地址并进行下载
    for list_d in list:
        print(list_d)
        try:  # 错误判断机制，不会出错就卡主
            print('----------图片开始下载啦----------')
            urllib.request.urlretrieve(list_d,jpg_name.format(path,x)) #下载图片到指定的位置
            x = x + random.randint(5,999999)    # 随机数保证每次取得不一致
        except:
            print('下载失败了')

url = 'https://moetu.club/category/illustration/page/2'
# 访问链接并获取列表的源码
page = urllib.request.urlopen(url).read().decode('utf-8')
# 把列表中的网页url读取出来
url_list = re.findall(r'href="(.*?)" rel="nofollow"',page)
# 一个for循环把url取出来传入img_png执行获取图片链接再调用img_downlad下载
for url_list_one in url_list:
    try:
        try:   #防止取不到名称
            # 获取url中的标题传入图片下载函数中使用
            m = urllib.request.urlopen(url_list_one).read().decode('utf-8')
            name = re.findall(r'<title>(.*?) &#',m)
        except:
            name = '随便吧' + str(random.randint(1,100))
        # 调用函数
        a = img_png(url_list_one)
        b = img_downlad(a,name)
    except:
        print('下载出错了')

3、多种方式实现下载
除了上面介绍的一种下载方式外还可以使用open进行下载

 for list_d in list:
        print(list_d)
        x = random.randint(5,999999)
        try:  # 错误判断机制，不会出错就卡主
            print('----------图片开始下载啦----------')
            list_d = requests.get(list_d).content
            jpg_name = path + str(name) + str(x) + '.jpg'  # 设置图片的名称
            with open(jpg_name, "wb") as code:
                code.write(list_d)
            # urllib.request.urlretrieve(list_d,jpg_name.format(path,x)) #下载图片到指定的位置
        except:
            print('下载失败了')

总的代码：

import random           # 用于随机数的添加
import urllib.request   # 用于链接的操作
import re        # 用于正则提取数据
import os        # 用于文件的处理
# 获取url中的图片链接地址
import requests


def img_png(url):
    page = urllib.request.urlopen(url) # 用于打开一个远程的url，并向链接发送请求，获取结果
    html_a = page.read()  # 获取页面的源码
    html_b = html_a.decode('utf-8')  # 源码格式化
    png = re.findall(r'img src="(.*?)" /><br',html_b)    # 正则提取出需要的图片链接信息
    # 正则表达式中可以根据左右边界来直接截取需要的数据，(.*?)代表需要截取的内容
    # 文中把img src="     和   " /><br    作为作为边界提取出来的恰恰是我们所需要的
    print('这是我们获取到的图片链接地址')
    print(png)
    return png

# 下面是编写图片下载的方法
def img_downlad(list,name):
    # 设置需要保存文件的位置
    path = 'D:\\test1'
    # 判断文件夹存在不存在，不存在则进行创建
    if not os.path.isdir(path):
        os.makedirs(path)
    path = path +'\\'  # 图片最后保存的文件夹位置
    x = 0
    # for 循环来循环提取图片的链接的地址并进行下载
    for list_d in list:
        print(list_d)
        x = random.randint(5,999999)
        try:  # 错误判断机制，不会出错就卡主
            print('----------图片开始下载啦----------')
            list_d = requests.get(list_d).content
            jpg_name = path + str(name) + str(x) + '.jpg'  # 设置图片的名称
            with open(jpg_name, "wb") as code:
                code.write(list_d)
            # urllib.request.urlretrieve(list_d,jpg_name.format(path,x)) #下载图片到指定的位置
        except:
            print('下载失败了')

url = 'https://moetu.club/category/illustration/page/2'
# 访问链接并获取列表的源码
page = urllib.request.urlopen(url).read().decode('utf-8')
# 把列表中的网页url读取出来
url_list = re.findall(r'href="(.*?)" rel="nofollow"',page)
# 一个for循环把url取出来传入img_png执行获取图片链接再调用img_downlad下载
for url_list_one in url_list:
    try:
        try:   #防止取不到名称
            # 获取url中的标题传入图片下载函数中使用
            m = urllib.request.urlopen(url_list_one).read().decode('utf-8')
            name = re.findall(r'<title>(.*?) &#',m)
        except:
            name = '随便吧' + str(random.randint(1,100))
        # 调用函数
        a = img_png(url_list_one)
        b = img_downlad(a,name)
    except:
        print('下载出错了')

${老夫的少女心}

关注

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
python 图片爬虫心得

python 图片爬虫心得一、流程1、熟悉需要爬虫网页的基本信息2、在网页源中找到图片的url并尝试打开3、编写python脚本4、执行脚本下载图片二、熟悉爬虫网页的基本信息爬虫之前首先你需要了解你爬虫的图片所在的网页的信息并根据这些信息找到图片，从而找到图片的url下载链接，进行下载尝试，如果成功说明OK可以把这个方式进行python脚本化批量执行。以萌图社：https://moetu.club/612.html为例：打开网页是这样的：其中的图片是我们需要的，也就是我们需要下载的图片
复制链接

扫一扫