最近在做网页分析需要爬取很多的网页,然后就使用selenium进行了一系列的操作,最后可以保存网页的首页截图,整体截图(包含所有滚动区域),HTML源文件和MHTML源文件,整理后的代码如下:
邮箱,qq邮箱,https://mail.qq.com/
邮箱,阿里邮箱,https://mail.aliyun.com/alimail/auth/login
邮箱,163邮箱,https://mail.163.com/
邮箱,新浪邮箱,https://mail.sina.com.cn/
搜索引擎,百度,https://www.baidu.com/
搜索引擎,搜狗,https://www.sogou.com/
搜索引擎,bing,https://www.bing.com/
商城,淘宝,https://world.taobao.com/
商城,小米商城,https://www.mi.com/shop
商城,京东,https://www.jd.com/
商城,唯品会,https://www.vip.com/
整体代码如下,可以根据自己的功能留下需要的部分
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import csv
# 构造webdriver
driver_path = r"C:\Program Files\Google\Chrome\Application\chromedriver.exe"
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(driver_path, options=options)
def save_page(kind, name, url):
driver.get(url)
save_path = f"../source/{kind}/{name}/{name}"
dir_name = os.path.dirname(path)
if not os.path.exists(dir_name):
os.makedirs(dir_name)
# 仅首页图片(未滚动) 1920x1080
driver.set_window_size(1920, 1680)
driver.get_screenshot_as_file(save_path + "_short.png")
# 整体截图(带滚动)
width = driver.execute_script("return document.documentElement.scrollWidth")
height = driver.execute_script("return document.documentElement.scrollHeight")
driver.set_window_size(width, height)
driver.get_screenshot_as_file(save_path + "_full.png")
# 保存为html
source_code = driver.page_source
with open(save_path + ".html", mode='w', encoding='utf-8') as html_file:
html_file.write(source_code)
# 保存为mhtml
res = driver.execute_cdp_cmd('Page.captureSnapshot', {})
# 2. write file locally
with open(save_path + ".mhtml", 'w', newline='') as sf:
sf.write(res['data'])
if __name__ == '__main__':
# 打开所有网页列表
with open("pagelist.txt", mode='r', encoding='utf-8') as f:
csv_reader = csv.reader(f)
for line in csv_reader:
print(line)
save_page(line[0], line[1], line[2])
driver.quit()