首先这个爬虫是公司要求在网上爬取数据,Java爬取数据就不是优势所以很多框架都不是很成熟,查阅了很久可以用HtmlUnit和jsoup 两个框架可以异步爬取到ajax响应的数据
用到依赖
<dependencies> <dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.29</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.13.1</version> </dependency>
1 .首先创建模拟浏览器client客户端
WebClient webClient = new WebClient(BrowserVersion.CHROME); webClient.getOptions().setJavaScriptEnabled(true); // 启用JS解释器,默认为true webClient.getOptions().setCssEnabled(false); // 禁用css支持 webClient.getOptions().setThrowExceptionOnScriptError(false); // js运行错误时,是否抛出异常 webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOption