Python-提取本地HTML文件提取图片URL

如此这般那便是极好

于 2023-08-03 16:12:05 发布

阅读量1.1k

点赞数

分类专栏： Python 文章标签： python html 爬虫

本文链接：https://blog.csdn.net/A4545156/article/details/132086014

版权

Python 专栏收录该内容

4 篇文章 2 订阅

订阅专栏

Python-读取本地HTML文件并提取图片URL

描述：本地D:\mg文件夹下存在多层文件夹，需要将文件夹下的文件以及子文件夹下的html文件读取，并提取img标签下的src属性，对img 文件保存

首先：导包：

os库用来读取本地文件

BeautifulSoup库用来解析本地HTML文件

lxml库用来转换本地文件内容文件为字符串

requests库用来请求IMG链接

import os
from bs4 import BeautifulSoup
from lxml import etree
import requests

第一步：打印出所有的文件，我们使用walk函数可以快速找到所有文件，使用if语句判断踢除非html文件

for root, dirs, files in os.walk("D:\\mgrtech.com.cn"):
	for file in files:
		htmlfile = os.path.join(root, file)
		if "html" in htmlfile:

第二步：读取HTML文件并将内容转换为字符串

parser = etree.HTMLParser()
			html = etree.parse(htmlfile, parser=parser)
			html_txt = etree.tostring(html, encoding="utf-8")
			info = html_txt.decode("utf-8")

第三步：解析HTML文件

经过查看文件，要提取的的img标签的src 属性，主要存在于div标签且class为col-md-4 col-sm-4 col-xs-6 pro-list或者class为scale下面，我们使用if，else进行判断，当div_tags为空时，就使用另外一个class

newline是我们在处理之前需要声明的空列表，主要用来存储URL链接，后面将进行去重

soup = BeautifulSoup(info, "html.parser")
			div_tags = soup.find_all("div", class_="col-md-4 col-sm-4 col-xs-6 pro-list")
			if div_tags != []:
				for div in div_tags:
					img_src = div.find_all("img", src=True)
					for img in img_src:
						imginfo = img["src"]
						newline.append(imginfo)
			else:
				div_tags = soup.find_all("div", class_="scale")
				for div in div_tags:
					img_src = div.find_all("img", src=True)
					for img in img_src:
						imginfo = img["src"]
						newline.append(imginfo)

经过以上步骤，所有的URL链接已经筛选出来了

接下来，需要进行保存函数的编写

raise_for_status该函数会检查HTTP请求是否成功。如果返回的状态码在2xx范围内（表示请求成功），则不会做任何处理。但如果存在错误（例如4xx或5xx状态码），则会引发一个异常（requests.exceptions.HTTPError），表示请求不成功。

for chunk in response.iter_content(chunk_size=8192):：此语句以指定大小（chunk_size=8192）迭代HTTP响应的内容。response.iter_content方法允许我们以块的形式读取响应数据，这对于处理大文件很有用。

def save_image(url, save_path, filename):
    response = requests.get(url)
    response.raise_for_status()
    with open(os.path.join(save_path, filename), 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)

最后：

newline中已经存在所有的URL文件，我们使用Python中的集合使用set将列表转换为集合类型，主要作用是去重，

使用循环遍集合，并调用save_image函数，这个函数是我们的保存函数，循环结束后，i.split(“/”)[-1]用来分割URL链接的最后一部分作为文件的名字，循环完成后所有图片文件将位于D:\img下，并已i.split(“/”)[-1]来命名

i.split(“/”)[-1]内容如下：

1-210P41623290-L.png
1-220ZQ012150-L.png
1-21102Q54A41H.png
1-210Q91043140-L.jpg
..........
..........

循环代码

setmap = set(newline)
coun = 0
for i in setmap:
	if "http" in i:
		filename = i.split("/")[-1]
		print(f"开始处理{i}")
		save_image(i, "D:\\img", filename)

所有代码如下：

import os
from bs4 import BeautifulSoup
from lxml import etree
import requests
newline = []
def save_image(url, save_path, filename):
    response = requests.get(url)
    response.raise_for_status()
    with open(os.path.join(save_path, filename), 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)
for root, dirs, files in os.walk("D:\\mg"):
	for file in files:
		htmlfile = os.path.join(root, file)
		if "html" in htmlfile:
			parser = etree.HTMLParser()
			html = etree.parse(htmlfile, parser=parser)
			html_txt = etree.tostring(html, encoding="utf-8")
			info = html_txt.decode("utf-8")
			soup = BeautifulSoup(info, "html.parser")
			div_tags = soup.find_all("div", class_="col-md-4 col-sm-4 col-xs-6 pro-list")
			if div_tags != []:
				for div in div_tags:
					img_src = div.find_all("img", src=True)
					for img in img_src:
						imginfo = img["src"]
						newline.append(imginfo)
			else:
				div_tags = soup.find_all("div", class_="scale")
				for div in div_tags:
					img_src = div.find_all("img", src=True)
					for img in img_src:
						imginfo = img["src"]
						newline.append(imginfo)
#
setmap = set(newline)
coun = 0
for i in setmap:
	if "http" in i:
		count = coun +1
		filename = i.split("/")[-1]
		print(f"开始处理{i}")
		save_image(i, "D:\\img", filename)
print(coun)

如此这般那便是极好

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python-提取本地HTML文件提取图片URL

newline中已经存在所有的URL文件，我们使用Python中的集合使用set将列表转换为集合类型，主要作用是去重，使用循环遍集合，并调用save_image函数，这个函数是我们的保存函数，循环结束后，i.split(“/”)[-1]用来分割URL链接的最后一部分作为文件的名字，循环完成后所有图片文件将位于D:\img下，并已i.split(“/”)[-1]来命名循环代码。
复制链接

扫一扫

专栏目录