使用Selenium3和无头浏览器抓取HAR数据

最新推荐文章于 2024-06-05 14:34:58 发布

weixin_33752045

最新推荐文章于 2024-06-05 14:34:58 发布

阅读量1.2k

点赞数 1

文章标签： python java 爬虫

原文链接：http://blog.51cto.com/dengshuangfu/2353496

版权

在此简单说下使用Selenium3与无头浏览器来抓取HAR日志的过程

1，添加需要的依赖包

<dependency>
      <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
       <version>3.141.59</version>
</dependency>
<dependency>
       <groupId>com.github.crashvb</groupId>
       <artifactId>chromedriver</artifactId>
       <version>2.41</version>
</dependency>
  <dependency>
       <groupId>net.lightbody.bmp</groupId>
       <artifactId>browsermob-core</artifactId>
       <version>2.1.5</version>
   </dependency>
<dependency>
       <groupId>edu.umass.cs.benchlab</groupId>
       <artifactId>harlib</artifactId>
       <version>1.1.2</version>
</dependency>

2，使用ChromeDriver

1）下载chromedriver
通过在谷歌浏览器地址栏输入以下地址来查看浏览器版本

chrome://version/

使用Selenium3和无头浏览器抓取HAR数据

2）基于浏览器的版本，下载对应版本的chromedriver

http://chromedriver.chromium.org/downloads

使用Selenium3和无头浏览器抓取HAR数据

例如我的存放路径
使用Selenium3和无头浏览器抓取HAR数据

3）设置系统参数

System.setProperty("webdriver.chrome.driver",
                "src/main/resources/drivers/chromedriver/v2.46/win32/chromedriver.exe");
// 设置日志
System.setProperty("webdriver.chrome.logfile", "D:\\chromedriver.log");
System.setProperty("webdriver.chrome.verboseLogging", "true");

4）创建代理

// 通过Proxy访问网络，用Proxy端的HAR内容以分析web应用的行为
BrowserMobProxyServer proxy = new BrowserMobProxyServer();
proxy.setTrustAllServers(true);
proxy.start();

5）创建ChromeDriver

Proxy seleniumProxy = ClientUtil.createSeleniumProxy(proxy);
ChromeOptions options = new ChromeOptions();
options.setAcceptInsecureCerts(true);
options.setHeadless(true);
options.setProxy(seleniumProxy);

/*
* (1) NONE: 当html下载完成之后，不等待解析完成，selenium会直接返回 
 * (2) EAGER:要等待整个dom树加载完成，即DOMContentLoaded这个事件完成，仅对html的内容进行下载解析 
* (3) NORMAL:即正常情况下，selenium会等待整个界面加载完成（指对html和子资源的下载与解析,如JS文件，图片等，不包括ajax）
*/
options.setPageLoadStrategy(PageLoadStrategy.NORMAL);
options.addArguments("--disable-gpu");
options.addArguments("--disable-infobars");
// 忽略不可信证书错误。
options.addArguments("--ignore-certificate-errors");
// options.addArguments("--window-size=1920,1080");
// 启动就最大化
// options.addArguments("--start-maximized");
// 禁止默认浏览器检查
options.addArguments("no-default-browser-check");
options.addArguments("--disable-cache");
options.addArguments("--disk-cache-size=0");
options.addArguments("--disable-icon-ntp");
options.addArguments("--disable-ntp-favicons");

// 设置用户代理
String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36";
options.addArguments(userAgent);

driver = new ChromeDriver(options);

driver.manage().timeouts().implicitlyWait(1, TimeUnit.MINUTES);
driver.manage().timeouts().pageLoadTimeout(60,TimeUnit.SECONDS) ;

proxy.enableHarCaptureTypes(CaptureType.REQUEST_CONTENT, CaptureType.RESPONSE_CONTENT,
                    CaptureType.RESPONSE_HEADERS);

proxy.newHar();
try {
    driver.get("http://www.baidu.com");
} catch (TimeoutException e) {
    System.out.println("解析地址有误,超时," + e.getMessage());
}
Har har = proxy.endHar();
HarLog harLog = har.getLog();

// 所有请求的列表
List<HarEntry> entries = harLog.getEntries();
for (HarEntry entry : entries) {
            // 响应的详细信息
            HarResponse harResponse = entry.getResponse();
            //响应状态码
            int status = harResponse.getStatus();

            HarContent harContent = harResponse.getContent();
            // 响应体的长度
            long contenSize = harContent.getSize();

            // 返回内容的MIME类型
            String mimeType = harContent.getMimeType();

            // 已连接服务器的IP地址（DNS解析的结果）[v1.2版本]
            String serverIp = entry.getServerIPAddress();

            // 请求的详细信息。
            HarRequest harRequest = entry.getRequest();
            // 请求地址
            String reqUrl = harRequest.getUrl();

            // 有关请求/响应往返的详细时间信息。
            HarTimings harTimings = entry.getTimings();
            // >>>>排队等待网络连接所花费的时间，如果不支持则返回-1
            long blocked = harTimings.getBlocked(TimeUnit.MICROSECONDS);

            // >>>> DNS解析时间，如果不使用当前请求则返回-1
            long dns = harTimings.getDns(TimeUnit.MICROSECONDS);

            // >>>> 创建TCP连接所需的时间，如果不支持则返回-1
            long connect = harTimings.getConnect(TimeUnit.MICROSECONDS);

            // >>>> 向服务器发送HTTP请求所需的时间
            long send = harTimings.getSend(TimeUnit.MICROSECONDS);

            // >>>> 正在等待服务器的响应[等待收到第一个数据包]
            long wait = harTimings.getWait(TimeUnit.MICROSECONDS);

            // >>>> 从服务器（或缓存）读取整个响应所需的时间[接收响应数据总耗时]
            long receive = harTimings.getReceive(TimeUnit.MICROSECONDS);
            pageInfo.setReceive(receive);

            // 从发起请求到完成响应的总耗时[阻塞等待耗时+DNS解析耗时+TCP连接耗时+发送HTTP请求耗时+等待HTTP响应耗时+接收HTTP响应包耗时].不包括-1值
            long totalTime = entry.getTime(TimeUnit.MILLISECONDS);

            // SSL/TLS协商所需的时间。如果定义了此字段，则时间也包括在连接字段中（以确保与har
            // 1.1向后兼容）。如果时间不适用于当前请求，请使用-1。
            long ssl = harTimings.getSsl(TimeUnit.MICROSECONDS);
        }

driver.quit();
proxy.stop();

3，使用FirefoxDriver(geckodriver)

和上面的主要区别就在于driver的创建
1）设置系统参数

//注意：需是浏览器的安装路径
public static String BROWSER_PATH = "C:\\Program Files\\Mozilla Firefox\\firefox.exe";
public static String GECKODRIVER_PATH = "C:\\Program Files\\Mozilla Firefox\\geckodriver.exe";

System.setProperty("webdriver.firefox.bin", BROWSER_PATH);
System.setProperty("webdriver.gecko.driver",GECKODRIVER_PATH);

2）创建driver

FirefoxOptions options = new FirefoxOptions();
options.setAcceptInsecureCerts(true);
options.setHeadless(true);
options.setProxy(seleniumProxy);

driver = new FirefoxDriver(options);

其他的内容不再赘述

转载于:https://blog.51cto.com/dengshuangfu/2353496

weixin_33752045

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
使用Selenium3和无头浏览器抓取HAR数据

在此简单说下使用Selenium3与无头浏览器来抓取HAR日志的过程1，添加需要的依赖包<dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-java</artifactId> <vers...
复制链接

扫一扫