使用selenium将网页保存网页截图，长截图，html文件，mhtml文件

最新推荐文章于 2025-04-04 12:34:39 发布

Hydrion-Qlz

最新推荐文章于 2025-04-04 12:34:39 发布

阅读量1.3k

点赞数

分类专栏： python 文章标签： selenium html chrome

本文链接：https://blog.csdn.net/qq_46311811/article/details/128632346

版权

python 专栏收录该内容

20 篇文章

订阅专栏

最近在做网页分析需要爬取很多的网页，然后就使用selenium进行了一系列的操作，最后可以保存网页的首页截图，整体截图（包含所有滚动区域），HTML源文件和MHTML源文件，整理后的代码如下：

邮箱,qq邮箱,https://mail.qq.com/
邮箱,阿里邮箱,https://mail.aliyun.com/alimail/auth/login
邮箱,163邮箱,https://mail.163.com/
邮箱,新浪邮箱,https://mail.sina.com.cn/
搜索引擎,百度,https://www.baidu.com/
搜索引擎,搜狗,https://www.sogou.com/
搜索引擎,bing,https://www.bing.com/
商城,淘宝,https://world.taobao.com/
商城,小米商城,https://www.mi.com/shop
商城,京东,https://www.jd.com/
商城,唯品会,https://www.vip.com/

整体代码如下，可以根据自己的功能留下需要的部分

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import csv

# 构造webdriver
driver_path = r"C:\Program Files\Google\Chrome\Application\chromedriver.exe"
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(driver_path, options=options)


def save_page(kind, name, url):
    driver.get(url)
    save_path = f"../source/{kind}/{name}/{name}"
    dir_name = os.path.dirname(path)
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)

    # 仅首页图片(未滚动) 1920x1080
    driver.set_window_size(1920, 1680)
    driver.get_screenshot_as_file(save_path + "_short.png")

    # 整体截图(带滚动)
    width = driver.execute_script("return document.documentElement.scrollWidth")
    height = driver.execute_script("return document.documentElement.scrollHeight")
    driver.set_window_size(width, height)
    driver.get_screenshot_as_file(save_path + "_full.png")

    # 保存为html
    source_code = driver.page_source
    with open(save_path + ".html", mode='w', encoding='utf-8') as html_file:
        html_file.write(source_code)

    # 保存为mhtml
    res = driver.execute_cdp_cmd('Page.captureSnapshot', {})
    # 2. write file locally
    with open(save_path + ".mhtml", 'w', newline='') as sf:
        sf.write(res['data'])


if __name__ == '__main__':
    # 打开所有网页列表
    with open("pagelist.txt", mode='r', encoding='utf-8') as f:
        csv_reader = csv.reader(f)
        for line in csv_reader:
            print(line)
            save_page(line[0], line[1], line[2])

    driver.quit()