爬取静态壁纸网站实现批量下载

最新推荐文章于 2022-11-14 18:52:58 发布

酷不酷炫

最新推荐文章于 2022-11-14 18:52:58 发布

阅读量624

点赞数

文章标签：爬虫 python

本文链接：https://blog.csdn.net/weixin_42404145/article/details/81292626

版权

毕竟是刚接触，还是要多写一些小项目来练手，确实是有进步，从最开始的什么都不懂，到现在已经可以慢慢的完全自己写过程，这次爬了一个小的图片网站批量保存了一些好看的壁纸
还有，编程真的是一个积累的东西，通过这次简单的爬取图片，又学到了一个不知何时会碰到的bug
关于全局变量，如果只是使用的话不用加global index说明，如果要修改就必须要加
代码如下：

import requests
import lxml
from bs4  import BeautifulSoup
import os
import time

index = 1 #全局变量用来之后命名壁纸
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
}

def ceshi(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code==200 or response.status_code==304:#我发现有时候这个状态码会是304，百度了一下是和缓冲有关，也算是成功吧，还没有太懂，之后懂了再回来解释
            return response.text
        return None
    except:
        return None

def parse_one_page(html):
    global index#之前出错就是因为这里没有声明一下这是个全局变量
    soup=BeautifulSoup(html,'lxml')
    imgs=soup.find_all(class_='lazy')#观察图片链接所在位置
    for img in imgs:
        img=img.attrs['src']#找到链接准确位置，经过观察这个src属性的值就是图片链接，这是输出标签中属性名为src的属性值
        img='https:'+img#图片链接完整格式
        picture=requests.get(img,headers=headers)
        if os.path.exists(r'f://picture.jpg'):
            os.remove(r'f://picture.jpg')
        with open("F:\pictureall\{}.jpg".format(index), 'wb') as jpg:#存入文件中，路径加名字
            jpg.write(picture.content)#以正确格式存入
            print("Successful preservation of No. 5" % index)
            index += 1#全局变量增加，用来命名图片


def main(i):
    if i=='1':#每个页面链接，经过观察，链接统一格式为下面那个，但是第一页特殊化了
        url='https://www.woyaogexing.com/shouji/z/omliuxing/'
    else:
        url='https://www.woyaogexing.com/shouji/z/omliuxing/index_'+i+'.html'
    html=ceshi(url)
    # print(html)
    parse_one_page(html)

if __name__=='__main__':

    for i in range(1,6):
        main(str(i))#改变链接地址
        time.sleep(1)