python爬虫学习笔记实例一：回车桌面

最新推荐文章于 2023-12-11 11:02:34 发布

其曰

最新推荐文章于 2023-12-11 11:02:34 发布

阅读量857

点赞数

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/qq_57610048/article/details/122341973

版权

这是我学习python时的一些笔记啦，在这里做一个记录，同时分享出来希望可以帮助到有需要的小伙伴，因为我是在看完Bs4,re,requests的综合教程后，按照案例自己照猫画虎的练手实例，所以这几种方式我都有用到，可能代码比较繁琐。

如果有错误欢迎指正，在评论区留下你宝贵的建议，毕竟我也是个小白啊

第一步：分析网站

目标网址：https://mm.enterdesk.com/bizhi/64445.html

打开网站后，因为是小网站，一些广告较多，但只需要获取目标8张图片即可。

也可以选择其他分类下的图片，自己练手当然用美女图片啦

下面按F12打开网站源代码(浏览器不同，打开方式也不同，我使用的是chrome)

打开源代码后，在Elements中使用选择工具(第一步)点击图片，得到图片的代码位置，即第三步中的代码，由此，我们得到了图片的位置：

<img src="https://up.enterdesk.com/edpic_360_360/c7/18/5d/c7185d0159b964858577609aa8e5f43a.jpg" title="海边性感美女写真" style="width: auto; height: 84px;">

为了保险起见，也是为了更好的分析代码，我们再用选择工具点击第二张图片看一下它藏在哪里

<img src="https://up.enterdesk.com/edpic_360_360/81/52/cd/8152cdf2149a4e6755d773bb734b2ad5.jpg" title="海边性感美女写真" style="width: auto; height: 84px;">

我们可以看到，储存格式是相同的，我们点击a标签里面的链接可以直接跳转到图片，可以看到也是没有问题的

接下来我们整体查看一下网页代码，有8个div标签，对应8张图片，前两个因为第一个是显示大图，第二个是next（比较奇怪的逻辑，反正就是特殊，理解至上），所以和后面的几个图片代码有一丢丢不同，但是不影响哈，他们的class名都叫swiper-slide（此处表示后面可以用查找类名的方式获取）

到此我们的网页分析结束了，接下来要用python获取图片

第二步：编写代码获取图片

1.首先编写一个函数来获取网页的所有源代码


def get_data():

    #目标网站
    baseurl = "https://mm.enterdesk.com/bizhi/64445.html"

    # 将浏览器信息封装到headers中
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
    }
    #用requsets.get方法获取网页代码
    res = requests.get(baseurl, headers=headers)
    #将获取到的代码用"utf-8"编译后储存在html中
    html = res.content.decode('utf-8')
    
    return html

（不知道怎么获取浏览器信息的小伙伴可以去查一下，也可以直接使用我的代码）

2.接下来解析我们获取到的代码


def Data_analyse(html):
    #这里用BeautifulSoup库解析一下网站代码
    soup =  BeautifulSoup(html, "html.parser")
    #创建一个保存各个图片地址的列表
    pic_link = []
    #在代码中查找类名叫swiper-slide的div标签，并将它添加到pic_link列表中
    for val in soup.find_all("div",class_="swiper-slide"):
        #将链接转换成字符串保存，防止后续步骤出错
        val = str(val)
        #用正则表达式将""里的图片地址提取出来
        findImgSrc = re.compile(r'<img.*src="(.*?)"')
        #用re.findall得到的是一个列表类型，所以我们将其中第一项也就是图片地址提出来放到pic_link中
        link = re.findall(findImgSrc, val)[0]
        pic_link.append(link)

        return pic_link

涉及到正则表达式大家可以去查一下（我不会）

3.获取图片并下载到本地

def get_pic(pic_link):
    #定义一个计数，方便后面下载图片时的命名
    count = 0
    #循环列表中地址
    for li in pic_link:
        count += 1
        response = requests.get(li)
        #将图片用二进制写入
        with open("C:\\Users\\Administrator\\Desktop\\pic_{}.jpg".format(count), 'wb') as f: 
            f.write(response.content)

这里涉及到一个requests里.content的使用，大家可以去查一下

第三步：完整代码

from bs4 import BeautifulSoup
import requests
import re


def get_data(baseurl):
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
    }
    res = requests.get(baseurl, headers=headers)
    html = res.content.decode('utf-8')
    return html

def Data_analyse(html):
    soup =  BeautifulSoup(html, "html.parser")
    pic_link = []
    for val in soup.find_all("div",class_="swiper-slide"):
        val = str(val)
        findImgSrc = re.compile(r'<img.*src="(.*?)"')
        link = re.findall(findImgSrc, val)[0]
        pic_link.append(link)
    return pic_link

def get_pic(pic_link):
    count = 0
    for li in pic_link:
        count += 1
        response = requests.get(li)
        with open("C:\\Users\\Administrator\\Desktop\\pic_{}.jpg".format(count), 'wb') as f:
            f.write(response.content)


if __name__ == '__main__':
    baseurl = "https://mm.enterdesk.com/bizhi/64445.html"
    html = get_data(baseurl)
    pic_link = Data_analyse(html)
    get_pic(pic_link)