Python + Selenium实现服务端web浏览器截屏网页图片

服务端浏览器截屏是结合 selenium + html2canvas 实现的通过在 Chrome web环境下截取生成图片。

selenium是最广泛使用的开源Web UI自动化测试套件之一。本文中使用selenium的PythonSKD,通过代码段完成对selenium的操作。

html2canvas是一个js库,可以实现在浏览器上截取网页或其一部分的“屏幕快照”。

总体思路是通过在服务端搭建chromedriver环境,利用selenium实现打开网页,运行js等一系列动作。然后通过注入html2convas的js代码,完成获取当前页浏览器任意dom元素图片。通过这种方式,可以在服务端截取任意网页区域。本文主要介绍在Linux环境下通过各种工具的组合,实现截取任意区域生成图片的方法,其他系统下方法类似。

环境准备

chromedriver 安装

首先需要安装浏览器内核环境,可以在官网
https://chromedriver.chromium.org下载需要的版本。如果Linux没有中文字体,需要安装中文字体,避免网页乱码。

selenium 安装

pip install selenium

html2canvas

可以访问
https://html2canvas.hertzen.com/
查看html2canvas的使用方法

运行方法

初始化

加载浏览器环境,打开网页。如果chromedriver没有在系统PATH下,需要为webdriver.Chrome()添加executable_path参数,指定chromedriver路径。

import time
from selenium import webdriver

# chrome 参数
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(chrome_options=chrome_options)
# 加载网站
driver.get("https://github.com/trending")
time.sleep(1)

注入脚本

首先需要引入html2canvas的js库。这里通过创建一个script标签,追加到head中,实现html2canvas。

html2canvas = """var s=window.document.createElement('script');
s.src='https://html2canvas.hertzen.com/dist/html2canvas.min.js';
window.document.head.appendChild(s);"""
driver.execute_script(html2canvas)
ime.sleep(1)

然后根据实际情况,获取dom元素,生成dom元素的图片数据。这里使用了html2canvas的方法,当转换canvas成功后,向body中添加一个含图片数据的元素。该元素可以是任意元素,主要用于存储图片数据。由于html2canvas()函数无法立即将图片数据以返回值的形式传递到python的execute_script()函数。所以通过共享一个dom元素,实现图片数据传递。

buildImage = """html2canvas(document.querySelector('#side_nav')).then(canvas => {
data = canvas.toDataURL('image/jpeg', 0.98);  
var input = document.createElement('input');
input.setAttribute('type', 'text');  
input.setAttribute('id', 'html2canvas_data');   
input.setAttribute('value', data);   
document.body.appendChild(input)
});
return data;"""
driver.execute_script(buildImage)
time.sleep(1)

最后获取图片信息。读取上一步中生成的id为html2canvas_data元素中存储的图片数据。execute_script()函数的返回值即为图片的base64数据。

imageData = """return document.getElementById('#html2canvas_data').getAttribute('value');"""
data = driver.execute_script(imageData)

退出chrome

driver.quit()

图片生成

上一步中,最后一个js执行的返回结果为图片的base64值。通过base64解码,便可以得到图片。这里需要注意的是返回值中的base64结果中会包含一段图片信息数据:data:image/jpeg;base64,。解码时需要先移除,再进行转码。

注意

  1. 执行每一条js后,需要增加延时,保证js执行成功。也可以使用selenium提供的显性等待函数WebDriverWait,完成监听元素加载状态。
  2. 加载html2canvas.js库时,若网站设置了安全策略,如跨域、域限制等策略,需要想方法绕过,再加载js
  3. 实际应用时,可将js下载到内网,加快js加载速度
  4. 实际应用时,可能会出现driver.quit()后,chromdriver进程仍然存在,需要手动kill

集成Docker

将上述所有需要的软件放入docker中。当有新的环境需要时,不必再次安装,同时也可以避免环境差异导致的意外的问题。

如果业务有需求,将上述图片生成方法,结合自身业务需求,在Docker中完成业务实现,从而提供一个可以生成符合业务需求的浏览器截屏服务。例如建立一个http服务,通过传入网页地址与对应的dom元素,返回对应dom元素的图片数据。

下面是基于已有的chromedriver镜像
robcherry/docker-chromedriver,制作的一个简单的可以运行浏览器截屏程序的环境。在chromedriver的基础上,添加字体库,安装selenium。

FROM robcherry/docker-chromedriver

COPY chinese /usr/share/fonts/chinese

RUN curl https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py && \ 
python /tmp/get-pip.py && \  
  pip install selenium && \  mkfontscale &&
   mkfontdir && fc-cache -fvCMD ["/usr/local/bin/supervisord", "-c", "/etc/supervisord.conf"]

其中chinese是字体目录,可以使用Windows系统下的字体目录中的字体文件。然后生成镜像:

docker build -t demo/html2canvas .

最后将上面的python程序,挂载到容器内,在容器内运行,查看效果

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
JavaScript HTML renderer The script allows you to take "screenshots" of webpages or parts of it, directly on the users browser. The screenshot is based on the DOM and as such may not be 100% accurate to the real representation as it does not make an actual screenshot, but builds the screenshot based on the information available on the page. How does it work? The script renders the current page as a canvas image, by reading the DOM and the different styles applied to the elements. It does not require any rendering from the server, as the whole image is created on the clients browser. However, as it is heavily dependent on the browser, this library is not suitable to be used in nodejs. It doesn't magically circumvent any browser content policy restrictions either, so rendering cross-origin content will require a proxy to get the content to the same origin. The script is still in a very experimental state, so I don't recommend using it in a production environment nor start building applications with it yet, as there will be still major changes made. Browser compatibility The script should work fine on the following browsers: Firefox 3.5+ Google Chrome Opera 12+ IE9+ Safari 6+ As each CSS property needs to be manually built to be supported, there are a number of properties that are not yet supported. Usage Note! These instructions are for using the current dev version of 0.5, for the latest release version (0.4.1), checkout the old readme. To render an element with html2canvas, simply call: html2canvas(element[, options]); The function returns a Promise containing the <canvas> element. Simply add a promise fullfillment handler to the promise using then: html2canvas(document.body).then(function(canvas) { document.body.appendChild(canvas); }); Building The library uses grunt for building. Alternatively, you can download the latest build from here. Clone git repository with submodules: $ git clone --recursive git://github.com/niklasvh/html2canvas.git Install Grunt and uglifyjs: $ npm install -g grunt-cli uglify-js Run the full build process (including lint, qunit and webdriver tests): $ grunt Skip lint and tests and simply build from source: $ grunt build Running tests The library has two sets of tests. The first set is a number of qunit tests that check that different values parsed by browsers are correctly converted in html2canvas. To run these tests with grunt you'll need phantomjs. The other set of tests run Firefox, Chrome and Internet Explorer with webdriver. The selenium standalone server (runs on Java) is required for these tests and can be downloaded from here. They capture an actual screenshot from the test pages and compare the image to the screenshot created by html2canvas and calculate the percentage differences. These tests generally aren't expected to provide 100% matches, but while commiting changes, these should generally not go decrease from the baseline values. Start by downloading the dependencies: $ npm install Run qunit tests: $ grunt test Examples For more information and examples, please visit the homepage or try the test console. Contributing If you wish to contribute to the project, please send the pull requests to the develop branch. Before submitting any changes, try and test that the changes work with all the support browsers. If some CSS property isn't supported or is incomplete, please create appropriate tests for it as well before submitting any code changes.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值