基于phantomjs与robot对网页截屏

最新推荐文章于 2023-06-04 09:49:14 发布

凉城的夜

最新推荐文章于 2023-06-04 09:49:14 发布

阅读量733

点赞数

分类专栏： Java 文章标签： java phantom 截屏

本文链接：https://blog.csdn.net/sinat_32651363/article/details/84761223

版权

Java 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

在爬虫开发过程中，或者其他方面有时候会有这种需求，截取网页图片，作为一种快照信息进行存储，在最近开发过程中也刚好碰到了这种需求，需要将爬虫过程中的网页进行快照信息保存，因此查看了一部分文档，现提供以下两种方式进行快照截图。

Python版本

python需要安装selenium，通过pip方式便可安装，期中下面有三种方式：

1. 调用Chrome或者FireFox浏览器方式，这种都需要打开本地一个无头浏览器，而这个浏览器需要自己单独下载，否则会报异常，网上例子很多，不再描述，下载后放置与python.exe同级目录便可运行，这两种方式都只能截取浏览器部分截图，如果该页面有滚动条，那么下面页面是无法截取到的。

2. 调用PhantomJS方式，这种就是解决滚动截屏问题，可以自动将当前浏览器页面全部内容截取，并且不打开浏览器，不用额外安装其他东西，至于其他带cookie，post，get不同请求请查阅其他资料。

from selenium import webdriver

# 使用PhantomJS获取浏览器截图，会截取整个网页信息，包含滚动条以下部分
br = webdriver.PhantomJS()
br.maximize_window()
br.get("http://www.baidu.com")
br.save_screenshot("./baidu01.png")

# 使用chrome浏览器获取浏览器截图，只会截取展现部分
br = webdriver.Chrome()
br.maximize_window()
br.get("http://www.baidu.com")
br.save_screenshot("./baidu02.png")

# 使用Firefox浏览器获取浏览器截图，只会截取展现部分
br = webdriver.Firefox()
br.maximize_window()
br.get("http://www.baidu.com")
br.save_screenshot("./baidu03.png")

JAVA版本

Java主要两种方式：

一种是Robot方式，这种截图不但可以截图，还可以实现一起操控类，截图只是它一个小功能，例子中也只是简单打开浏览器，截取当前显示部分，然后保存成图片，关闭浏览器，因此对于存在滚动条的，无法全部截取，因为要模拟操控滚动条，形成一个输入流，循环读取，显得太过于麻烦，不太通用。不过Robot最主要的并不是截图，而是操控，没准需要固定点击的可以用该方式模拟操作呢。

public class CutPicture {
    /**
     * 图像截取并保存
     * @param url： 请求链接
     * @throws MalformedURLException
     * @throws IOException
     * @throws URISyntaxException
     * @throws AWTException
     */
    public static void saveHtmlImg(String url) throws IOException, URISyntaxException, AWTException {
        Desktop.getDesktop().browse(new URL(url).toURI());
        Robot robot = new Robot();
        robot.delay(2000);
        Dimension d = new Dimension(Toolkit.getDefaultToolkit().getScreenSize());
        int width = (int) d.getWidth();
        int height = (int) d.getHeight();
        // 最大化浏览器
        robot.keyRelease(KeyEvent.VK_F11);
        robot.delay(2000);
        /**
         * 所有操作浏览器方法都在此设置
         * 如设置鼠标滚轴向下：robot.mouseWheel(KeyEvent.VK_DOWN);
         * 每项设置后都需增加延迟robot.delay()
         * Robot 对屏幕截取只是一点小小的功能，而实际它可以自定义很多方法去实际操控电脑，所以慎用
         */
        Image image = robot.createScreenCapture(new Rectangle(0, 0, width, height));
        BufferedImage bi = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
        Graphics g = bi.createGraphics();
        /**
         * x:截图x轴起点
         * y: 截图y轴起点，防止截取浏览器搜索框，因此设置为-100，相应height+100
         */
        g.drawImage(image, 0, -100, width, height+100, null);
        // 保存图片
        File fileOut = new File("./imgHtml");
        fileOut.mkdirs();
        ImageIO.write(bi, "jpg", new File("./imgHtml/001.jpg"));
        //关闭浏览器
        closeBrowse();
    }

    /**
     * 关闭浏览器
     */
    public static void closeBrowse(){
        try {
            Runtime.getRuntime().exec("taskkill /F /IM chrome.exe");
            Runtime.getRuntime().exec("taskkill /F /IM iexplorer.exe");
            Runtime.getRuntime().exec("taskkill /F /IM 360se.exe");
            Runtime.getRuntime().exec("taskkill /F /IM firefox.exe");
            Runtime.getRuntime().exec("taskkill /F /IM iexplorer.exe");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }


    public static void main(String[] args) throws AWTException, IOException, URISyntaxException {
        String url = "http:www.baidu.com";
        String url2 = "http://top.baidu.com/?fr=mhd_card";
        saveHtmlImg("http:www.baidu.com");

    }

}

另一种则是PhantomJs方式，其实就是python版本的Java对应，但是需要提前下载phantomjs，网上链接很多，通过这种方式的截屏，则可以实现滚动截屏，但是截屏，截取多少，怎么截取，其实靠它的配置文件一个js文件控制，phantomjs官方js文件怎么写，网上demo很多，其实java只是调用，至于怎么截取，怎么用，完全靠js文件编写。

public class PhantomTools {

    /**
     * 本地链接中文转Unicode码
     * @param url
     * @return
     */
    public static String chineseToUnicode(String url){
        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < url.length(); i++) {
            char c = url.charAt(i);
            if (c >= 0 && c <= 255) {
                sb.append(c);
            } else {
                byte[] b;
                try {
                    b = String.valueOf(c).getBytes("utf-8");
                } catch (Exception ex) {
                    System.out.println(ex);
                    b = new byte[0];
                }
                for (int j = 0; j < b.length; j++) {
                    int k = b[j];
                    if (k < 0)
                        k += 256;
                    sb.append("%" + Integer.toHexString(k).toUpperCase());
                }
            }
        }
        return sb.toString();
    }

    /**
     * 图片截取，以下两个路径根据自己工程目录更改
     * phantomjsPath：phantomjs.exe路径
     * cutPictureJsPath：截图javascript脚本的路径
     * @param url
     * @throws IOException
     */
    public static void  cutPicturl(String url) throws IOException {
        String basePath = System.getProperty("user.dir")+"\\";
        String BLANK = "  ";
        //phantomjs.exe路径
        String phantomjsPath = basePath+"pictureHtmlSpider\\src\\PhantomCutPicture\\staticPhantomJs\\phantomjs.exe" + BLANK;
        //截图javascript脚本的路径 E:\JAVA_PROJECT\imgHtmlSpider\pictureHtmlSpider\src\PhantomCutPicture\staticPhantomJs\rasterize.js
        String cutPictureJsPath = basePath+"pictureHtmlSpider\\src\\PhantomCutPicture\\staticPhantomJs\\rasterize.js" + BLANK;
        //url地址
        String urlPath = chineseToUnicode(url)+BLANK;
        // 保存图片
        String filePath = "./imgHtml";
        /**这里如果要改变文件夹名字imgHTML，需要同时更改cutPictureMain()方法中的if判断*/
        File fileOut = new File(filePath);
        fileOut.mkdirs();
        //输出目录
        Random random = new Random();
        String savePath = filePath+"/"+(random.nextInt(10)+1000)+".png";
        System.out.println(savePath);
        /**
         * 调用phantomjs
         * phantomjsPath：phantomjs路径
         * cutPictureJsPath：rasterize.js路径，全屏截图配置就在此编写，定制化截屏也需更改此文件
         *   网页带cookie请求也需要编写此js文件，便可带cookie访问，适用于爬虫开发
         * urlPath：网页请求
         * savePath：图片保存路径
         */
        Process process = Runtime.getRuntime().exec(phantomjsPath
                + cutPictureJsPath
                + urlPath
                + savePath);

        InputStream is = process.getInputStream();
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        int tmp=0;
        while (br.readLine() != null)
        {
            tmp+=1;
            System.out.println("==> Start cutPicture :"+tmp+" page");
        }
    }
    public static void main(String[] args) throws IOException {
        String url = "http:www.baidu.com";
        String url2 = "http://top.baidu.com/?fr=mhd_card";
        cutPicturl(chineseToUnicode(url2));
    }
}

它的配置rasterize.js文件：

"use strict";
var page = require('webpage').create(),
    system = require('system'),
    address, output, size;
    page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';

if (system.args.length < 3 || system.args.length > 5) {
    console.log('Usage: rasterize.js URL filename [paperwidth*paperheight|paperformat] [zoom]');
    console.log('  paper (pdf output) examples: "5in*7.5in", "10cm*20cm", "A4", "Letter"');
    console.log('  image (png/jpg output) examples: "1920px" entire page, window width 1920px');
    console.log('                                   "800px*600px" window, clipped to 800x600');
    phantom.exit(1);
} else {
    address = system.args[1];
    output = system.args[2];
    page.viewportSize = { width: 800, height: 900 };
    if (system.args.length > 3 && system.args[2].substr(-4) === ".pdf") {
        size = system.args[3].split('*');
        page.paperSize = size.length === 2 ? { width: size[0], height: size[1], margin: '0px' }
            : { format: system.args[3], orientation: 'portrait', margin: '1cm' };
    } else if (system.args.length > 3 && system.args[3].substr(-2) === "px") {
        size = system.args[3].split('*');
        if (size.length === 2) {
            pageWidth = parseInt(size[0], 10);
            pageHeight = parseInt(size[1], 10);
            page.viewportSize = { width: pageWidth, height: pageHeight };
            page.clipRect = { top: 0, left: 0, width: pageWidth, height: pageHeight };
        } else {
            console.log("size:", system.args[3]);
            pageWidth = parseInt(system.args[3], 10);
            pageHeight = parseInt(pageWidth * 3/4, 10); // it's as good an assumption as any
            console.log ("pageHeight:",pageHeight);
            page.viewportSize = { width: pageWidth, height: pageHeight };
        }
    }
    if (system.args.length > 4) {
        page.zoomFactor = system.args[4];
    }
    page.open(address, function (status) {
        if (status !== 'success') {
            console.log('Unable to load the address!');
            phantom.exit(1);
        } else {
            window.setTimeout(function () {
                page.render(output);
                phantom.exit();
            }, 200);
        }
    });
}

以上例子都经过测试，可以对浏览器中包含滚动条的全部截屏。

凉城的夜

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
基于phantomjs与robot对网页截屏

在爬虫开发过程中，或者其他方面有时候会有这种需求，截取网页图片，作为一种快照信息进行存储，在最近开发过程中也刚好碰到了这种需求，需要将爬虫过程中的网页进行快照信息保存，因此查看了一部分文档，现提供以下两种方式进行快照截图。Python版本python需要安装selenium，通过pip方式便可安装，期中下面有三种方式：1. 调用Chrome或者FireFox浏览器方式，这种都需要打开本...
复制链接

扫一扫