使用正则爬取网页图

最新推荐文章于 2022-11-25 00:11:20 发布

唐鸿23

最新推荐文章于 2022-11-25 00:11:20 发布

阅读量157

点赞数

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/qq_44775960/article/details/125569727

版权

基本原理
1、所有和网页，均是HTML，HTML首先是一个大的字符串，可以按照字符串处理的方式对响应进行解析处理。其次，HTML本身也是一门标记语言，与XML是同宗同源，所以可以使用DOM对其文本进行处理。
2、所有的爬虫，核心基于超链接，进而实现网站和网页的跳转。给我一个网站，爬遍全世界。
3、如果要实现一个整站爬取程序，首先需要收集到站内所有网址，并且将重复网址去除，开始爬取内容并保存在本地或数据库，进行实现后续目标。

# 请求网页************
import requests
import os
import re

headers = {
    'User-Agent':'wahaha'
    #告诉服务器的身份
    #{'User-Agent': 'wahaha', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

}
#爬主页网址
response = requests.get('https://www.vmgirls.com/',headers=headers)
html=response.text
# print(html)
urld = re.findall('href="(.*?)" title=".*?"',html)
# print(type(urld))
for url in urld:
    if 'http' in url:
        urlq=url.split('\"')[0]
        try:#防止部分ip报错程序中止
            # 爬分页
            response = requests.get(url=urlq,headers=headers)
            # print(response.request.headers)
            # print(response.text)
            html=response.text
            # print(html)
            #解析网页***************
            #正则匹配
            urls = re.findall('https://.*?\?src=(.*?)&.*?',html)

            # urls = re.findall('<img src="(.*?)" data-pic=".*?" alt=".*?" title=".*?">',html)
            # print(urls)
            dirname=re.findall('target=".*?">(.*?)</a></span><i',html)[-1]
            # print(dirname)
            if not os.path.exists(dirname):
                os.mkdir(dirname)
            #保存图片***********
            for url in urls:
                # 图片名字
                # time.sleep(1)
                filename=url.split('/')[-1]
                # print(filename)
                response = requests.get(url=url,headers=headers)
                #w写，b编码
                with open(dirname + '/' + filename,'wb') as f:
                    f.write(response.content)
        except Exception as e:
            print(e)