Python-读取本地HTML文件并提取图片URL
描述:本地D:\mg文件夹下存在多层文件夹,需要将文件夹下的文件以及子文件夹下的html文件读取,并提取img标签下的src属性,对img 文件保存
首先:导包:
os库用来读取本地文件
BeautifulSoup库用来解析本地HTML文件
lxml库用来转换本地文件内容文件为字符串
requests库用来请求IMG链接
import os
from bs4 import BeautifulSoup
from lxml import etree
import requests
第一步:打印出所有的文件,我们使用walk函数可以快速找到所有文件,使用if语句判断踢除非html文件
for root, dirs, files in os.walk("D:\\mgrtech.com.cn"):
for file in files:
htmlfile = os.path.join(root, file)
if "html" in htmlfile:
第二步:读取HTML文件并将内容转换为字符串
parser = etree.HTMLParser()
html = etree.parse(htmlfile, parser=parser)
html_txt = etree.tostring(html, encoding="utf-8")
info = html_txt.decode("utf-8")
第三步:解析HTML文件
经过查看文件,要提取的的img标签的src 属性,主要存在于div标签且class为col-md-4 col-sm-4 col-xs-6 pro-list或者class为scale下面,我们使用if,else进行判断,当div_tags为空时,就使用另外一个class
newline是我们在处理之前需要声明的空列表,主要用来存储URL链接,后面将进行去重
soup = BeautifulSoup(info, "html.parser")
div_tags = soup.find_all("div", class_="col-md-4 col-sm-4 col-xs-6 pro-list")
if div_tags != []:
for div in div_tags:
img_src = div.find_all("img", src=True)
for img in img_src:
imginfo = img["src"]
newline.append(imginfo)
else:
div_tags = soup.find_all("div", class_="scale")
for div in div_tags:
img_src = div.find_all("img", src=True)
for img in img_src:
imginfo = img["src"]
newline.append(imginfo)
经过以上步骤,所有的URL链接已经筛选出来了
接下来,需要进行保存函数的编写
raise_for_status该函数会检查HTTP请求是否成功。如果返回的状态码在2xx范围内(表示请求成功),则不会做任何处理。但如果存在错误(例如4xx或5xx状态码),则会引发一个异常(requests.exceptions.HTTPError),表示请求不成功。
for chunk in response.iter_content(chunk_size=8192)::此语句以指定大小(chunk_size=8192)迭代HTTP响应的内容。response.iter_content方法允许我们以块的形式读取响应数据,这对于处理大文件很有用。
def save_image(url, save_path, filename):
response = requests.get(url)
response.raise_for_status()
with open(os.path.join(save_path, filename), 'wb') as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
最后:
newline中已经存在所有的URL文件,我们使用Python中的集合使用set将列表转换为集合类型,主要作用是去重,
使用循环遍集合,并调用save_image函数,这个函数是我们的保存函数,循环结束后,i.split(“/”)[-1]用来分割URL链接的最后一部分作为文件的名字,循环完成后所有图片文件将位于D:\img下,并已i.split(“/”)[-1]来命名
i.split(“/”)[-1]内容如下:
1-210P41623290-L.png
1-220ZQ012150-L.png
1-21102Q54A41H.png
1-210Q91043140-L.jpg
..........
..........
循环代码
setmap = set(newline)
coun = 0
for i in setmap:
if "http" in i:
filename = i.split("/")[-1]
print(f"开始处理{i}")
save_image(i, "D:\\img", filename)
所有代码如下:
import os
from bs4 import BeautifulSoup
from lxml import etree
import requests
newline = []
def save_image(url, save_path, filename):
response = requests.get(url)
response.raise_for_status()
with open(os.path.join(save_path, filename), 'wb') as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
for root, dirs, files in os.walk("D:\\mg"):
for file in files:
htmlfile = os.path.join(root, file)
if "html" in htmlfile:
parser = etree.HTMLParser()
html = etree.parse(htmlfile, parser=parser)
html_txt = etree.tostring(html, encoding="utf-8")
info = html_txt.decode("utf-8")
soup = BeautifulSoup(info, "html.parser")
div_tags = soup.find_all("div", class_="col-md-4 col-sm-4 col-xs-6 pro-list")
if div_tags != []:
for div in div_tags:
img_src = div.find_all("img", src=True)
for img in img_src:
imginfo = img["src"]
newline.append(imginfo)
else:
div_tags = soup.find_all("div", class_="scale")
for div in div_tags:
img_src = div.find_all("img", src=True)
for img in img_src:
imginfo = img["src"]
newline.append(imginfo)
#
setmap = set(newline)
coun = 0
for i in setmap:
if "http" in i:
count = coun +1
filename = i.split("/")[-1]
print(f"开始处理{i}")
save_image(i, "D:\\img", filename)
print(coun)