利用爬虫爬取博客里面的图片

最新推荐文章于 2021-10-10 17:30:11 发布

Zeker62

最新推荐文章于 2021-10-10 17:30:11 发布

阅读量141

点赞数

分类专栏：网络安全学习文章标签： python 正则表达式 http

本文链接：https://blog.csdn.net/ZripenYe/article/details/119532344

版权

网络安全学习专栏收录该内容

134 篇文章 22 订阅

订阅专栏

先上代码，读者可以复制代码运行一下看下效果。

'''
获取主页的内容信息

Mozilla/5.0 (Windows NT 8.1; Win32; x86) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62
'''


import urllib.request
import re

class GetHtml(object):
    def __init__(self,URL,HEAD):
        self.url=URL
        self.head=HEAD

    def get_index(self):
        self.request=urllib.request.Request(self.url)
        self.request.add_header("user-agent",self.head)
        self.response=urllib.request.urlopen(self.request)
        return self.response.read()

    def get_list(self): #获取图片列表
        self.strimglist=[]
        self.imglist= re.findall(b"https://img-blog.csdnimg.cn/\w{32}.png",self.get_index())
        #print(self.imglist)

        for i in self.imglist:
            self.strimglist.append(str(i,encoding="utf8"))
        #print(self.strimglist)
        return self.strimglist
    def get_image(self): # 下载图片的信息
        num=0
        # 由前面index方法
        for self.url in self.get_list():
            num +=1
            with open(str(num)+".png","wb") as f:
                f.write(self.get_index())

            


html=GetHtml("https://blog.csdn.net/ZripenYe/article/details/119455438","Mozilla/5.0 (Windows NT 8.1; Win32; x86) \
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62");

#print(html.get_index())
#print(html.get_list())
html.get_image()

引入两个模块： urllib.request和re

urllib.request 模块是一个模仿浏览器访问过程的模块
re 模块用于匹配正则表达式

利用GetHtml类将信息传入对象。

主要是有两个信息：

url连接（即我的一篇博客的内容）
伪造的头内容

html=GetHtml(“https://blog.csdn.net/ZripenYe/article/details/119455438”,“Mozilla/5.0 (Windows NT 8.1; Win32; x86) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.62”);

返回url信息

urllib.request.urlopen() 方法可以返回url信息，赋值给self.response。并且使用return self.response.read()可以返回可读的内容。

正则匹配图片链接内容

经过观察发现，图片的格式都是https://img-blog.csdnimg.cn/**.png，而**里面有着32 个字符，所以我们的正则表达式代码是：
self.imglist= re.findall(b"https://img-blog.csdnimg.cn/\w{32}.png",self.get_index())
url前面加b是以字节的形式匹配的意思。

之后的操作是放入一个列表中，并用utf-8的编码去掉b