首先初步认识htmlunit
htmlunit 是一款开源的java 页面分析工具,读取页面后,可以有效的使用htmlunit分析页面上的内容。项目可以模拟浏览器运行,被誉为java浏览器的开源实现。是一个没有界面的浏览器,运行速度迅速。是junit的扩展之一
采用的是Rhinojs引擎。模拟js运行
常规意义上,该项目可以用来进行页面的测试工作,实现网页自动化测试,(包括JS)
但是一般来说,在小型爬虫项目中,这种框架十分常用,可以有效的分析出 dom的标签,并且有效的运行页面上的js以便得到一些需要执行JS才能得到的值。
我们研究的爬虫应用中 仅仅用Httpclient+jsoup是不够的,因为有些页面数据是js加载出来的 httpclient解析不出来,比如百度云用户页面,淘宝页面,连开源中国博客的博客内容都是Js加载或者处理的,所以httpClient是解析不出来的,这里htmlunit就是一个很好的方案,内嵌js浏览器,模拟Js运行,把结果执行出来。这个就是我们想要的。所以搞爬虫,htmlunit必须要掌握好;(摘录自http://blog.java1234.com)
项目具体组成(本次爬虫采用的是maven,目前这个项目的maven是复制的,由于使用之前做过的,所以很多是不那么重要的):
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>UserEchart</groupId>
<artifactId>fly</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>war</packaging>
<name>fly Maven Webapp</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<maven.compiler.encoding>UTF-8</maven.compiler.encoding>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spring.version>5.0.2.RELEASE</spring.version>
<mybatis.version>3.4.5</mybatis.version>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<!-- alipay -->
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.10</version>
</dependency>
<dependency>
<groupId>commons-configuration</groupId>
<artifactId>commons-configuration</artifactId>
<version>1.10</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.1</version>
</dependency>
<dependency>
<groupId>com.google.zxing</groupId>
<artifactId>core</artifactId>
<version>2.1</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.7</version>
</dependency>
<dependency>
<groupId>org.hamcrest</groupId>
<artifactId>hamcrest-core</artifactId>
<version>1.3</version>
</dependency>
<!-- Spring -->
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-jdbc</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-webmvc</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.6</version>
</dependency>
<!--<dependency>-->
<!--<groupId>net.sourceforge.jexcelapi</groupId>-->
<!--<artifactId>jxl</artifactId>-->
<!--<version>2.6.12</version>-->
<!--</dependency>-->
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>${spring.version}</version>
<scope>test</scope>
</dependency>
<!-- mybatis -->
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis</artifactId>
<version>${mybatis.version}</version>
</dependency>
<!-- mybatis-spring -->
<dependency>
<groupId>org.mybatis</groupId>
<artifactId>mybatis-spring</artifactId>
<version>1.3.0</version>
</dependency>
<!-- MySQL驱动 -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
<scope>runtime</scope>
</dependency>
<!-- druid连接池 -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.0.26</version>
</dependency>
<!-- aspectJ织入 -->
<dependency>
<groupId>org.aspectj</groupId>
<artifactId>aspectjweaver</artifactId>
<version>1.8.7</version>
</dependency>
<!-- JSON处理 -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.9.5</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.41</version>
</dependency>
<!-- JUnit4测试工具 -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<!-- Servlet-API -->
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>javax.servlet-api</artifactId>
<version>3.0.1</version>
<scope>provided</scope>
</dependency>
<!-- JSTL标签库 -->
<dependency>
<groupId>jstl</groupId>
<artifactId>jstl</artifactId>
<version>1.2</version>
</dependency>
<!-- 日志 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.21</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.21</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.alipay.sdk/alipay-sdk-java -->
<!-- https://mvnrepository.com/artifact/com.alipay.sdk/alipay-sdk-java -->
<dependency>
<groupId>com.alipay.sdk</groupId>
<artifactId>alipay-sdk-java</artifactId>
<version>3.0.0</version>
</dependency>
<!-- lombok插件 -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.16.6</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>RELEASE</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>RELEASE</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>RELEASE</version>
<scope>compile</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/activation/activation -->
<dependency>
<groupId>activation</groupId>
<artifactId>activation</artifactId>
<version>1.0.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-email -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-email</artifactId>
<version>1.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/javax.mail/javax.mail-api -->
<dependency>
<groupId>javax.mail</groupId>
<artifactId>javax.mail-api</artifactId>
<version>1.4.4</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>RELEASE</version>
<scope>compile</scope>
</dependency>
<!-- lucene -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>4.10.4</version>
</dependency>
<!-- lucene高亮相关的包 -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>4.10.4</version>
</dependency>
<!-- IK -->
<dependency>
<groupId>com.janeluo</groupId>
<artifactId>ikanalyzer</artifactId>
<version>2012_u6</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>4.10.4</version>
</dependency>
<!--使用mybatis分页插件-->
<!-- https://mvnrepository.com/artifact/com.github.pagehelper/pagehelper -->
<dependency>
<groupId>com.github.pagehelper</groupId>
<artifactId>pagehelper</artifactId>
<version>5.1.8</version>
</dependency>
<!-- jsoup和HtmlUnit爬虫包-->
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit -->
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.33</version>
</dependency>
<!--获取请求头的工具类-->
<!-- https://mvnrepository.com/artifact/eu.bitwalker/UserAgentUtils -->
<dependency>
<groupId>eu.bitwalker</groupId>
<artifactId>UserAgentUtils</artifactId>
<version>1.21</version>
</dependency>
<!-- ehcache核心jar包 -->
<dependency>
<groupId>net.sf.ehcache</groupId>
<artifactId>ehcache-core</artifactId>
<version>2.6.11</version>
</dependency>
<!-- MyBatis与ehcache整合jar包 -->
<dependency>
<groupId>org.mybatis.caches</groupId>
<artifactId>mybatis-ehcache</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>cglib</groupId>
<artifactId>cglib</artifactId>
<version>2.2.2</version>
</dependency>
<!--ECharts工具类-->
<dependency>
<groupId>com.github.abel533</groupId>
<artifactId>ECharts</artifactId>
<version>3.0.0.2</version>
</dependency>
<dependency>
<groupId>commons-dbutils</groupId>
<artifactId>commons-dbutils</artifactId>
<version>1.7</version>
</dependency>
<!-- 地理位置工具类-->
<dependency>
<groupId>com.maxmind.geoip2</groupId>
<artifactId>geoip2</artifactId>
<version>2.8.1</version>
</dependency>
<!--标签转换工具类-->
<dependency>
<groupId>com.0opslab</groupId>
<artifactId>opslabJutil</artifactId>
<version>1.0.8</version>
</dependency>
</dependencies>
<build>
<plugins>
<!--Java编译器插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
<compilerArguments>
<extdirs>${project.basedir}/src/main/webapp/WEB-INF/lib</extdirs>
</compilerArguments>
</configuration>
</plugin>
<plugin>
<groupId>org.mybatis.generator</groupId>
<artifactId>mybatis-generator-maven-plugin</artifactId>
<version>1.3.2</version>
<configuration>
<configurationFile>src/main/resources/generatorConfig.xml
</configurationFile>
<verbose>true</verbose>
<overwrite>false</overwrite>
</configuration>
<dependencies>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
<scope>runtime</scope>
</dependency>
</dependencies>
</plugin>
<!-- 添加一个tomcat插件 -->
<plugin>
<groupId>org.apache.tomcat.maven</groupId>
<artifactId>tomcat7-maven-plugin</artifactId>
<configuration>
<!-- tomcat启动的端口 -->
<port>80</port>
<!-- 应用的上下文路径 -->
<path>/</path>
<!--解决中文乱码-->
<uriEncoding>UTF-8</uriEncoding>
</configuration>
<!-- 在打包的时候运行这个容器 -->
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>run</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
SpiderTest(重点重点重点)
package test;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.test.context.ContextConfiguration;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;
import java.io.IOException;
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration("classpath:application.xml")
public class SpiderTest {
@Test
public void spider(){
WebClient webClient=new WebClient();//实例化客户端
try {
HtmlPage htmlPage=webClient.getPage("http://baidu.com");//抓取页面
System.out.println("网页html"+htmlPage.asXml());//抓取网页html
System.out.println("网页文本"+htmlPage.asText());
} catch (IOException e) {
e.printStackTrace();
}finally {
webClient.close();//关闭客户端
}
}
}
显示的记过如下: