phantomjs java 爬虫_java网络爬虫-利用phantomjs和jsoup爬取动态ajax加载页面

最新推荐文章于 2021-10-17 11:46:00 发布

weixin_39654245

最新推荐文章于 2021-10-17 11:46:00 发布

阅读量362

点赞数

文章标签： phantomjs java 爬虫

本文链接：https://blog.csdn.net/weixin_39654245/article/details/114080574

版权

本文介绍了如何使用Java结合PhantomJS爬取动态加载的AJAX页面。首先，需要下载并配置PhantomJS环境，然后编写JS脚本解析页面。在Java代码中，通过Runtime类执行PhantomJS脚本来获取动态HTML，接着使用Jsoup库解析内容。示例代码展示了读取页面、保存文本和下载图片的方法。

摘要由CSDN通过智能技术生成

java基于windows爬取ajax加载的动态页面需要一定的辅助工具支持，本文爬取ajax加载的动态页面所使用的工具是phantomJS(关于phantomJS的介绍百度一大堆)

下载之后解压文件，为了后面方便使用建议单独放在一个文件夹里面，例如我这边是放在F盘下面单独的文件夹phantomjs,然后进入phantomjs--bin点击运行phantomjs.exe，出现一下界面：

即表示可以正常运行js代码了。(如果要经常使用建议配置path环境)

接下来就是爬取页面了。

首先需要写一个js(例：parser.js)：

1 system = require('system')2 address = system.args[1];3 var page = require('webpage').create();4 var url =address;5

6 page.settings.resourceTimeout = 1000*10; //10 seconds

7 page.onResourceTimeout = function(e) {8 console.log(page.content);9 phantom.exit(1);10 };11

12 page.open(url, function(status) {13 //Page is loaded!

14 if (status !== 'success') {15 console.log('Unable to post!');16 } else{17 console.log(page.content);18 }19 phantom.exit();20 });

然后是java代码(我的parser.js是放在F盘下面的)：

1 //读取动态页面

2 public staticString dynamicHtml(String url){3 Runtime rt =Runtime.getRuntime();4 Process process = null;5 String html = "";6 try{7 process = rt.exec("F:\\phantomjs\\bin\\phantomjs.exe F:/parser.js " +url);8 InputStream in =process.getInputStream();9 InputStreamReader reader = new InputStreamReader(in, "UTF-8");10 BufferedReader br = newBufferedReader(reader);11 String tmp = "";12 while ((tmp = br.readLine()) != null) {13 html = html +tmp;14 }15 br.close();16 reader.close();17 } catch(IOException e) {18 e.printStackTrace();19 }20 returnhtml;21 }

处理逻辑(利用Jsoup爬取)：

1 public static voidReadAjaxDynamicHtml(String htmlUrl){2 String imageHtml =dynamicHtml(htmlUrl);3 Document imageDoc =Jsoup.parse(imageHtml);4 //如果选择其中部分元素有class就用：5 //Elements childrenImg = imageDoc.select(".class");6 //System.err.println(childrenImg.html());7 //System.err.println(childrenImg.text());8 //如果选择其中部分标签比如img：9 //Elements childrenImg = imageDoc.select("img");

10 System.err.println(imageDoc);11 /*接下来的处理逻辑*/

12 //...

13 }

main方法调用示例：

1 public static voidmain(String[] args) {2 String htmlUrl = "http://www.baidu.com";3 ReadAjaxDynamicHtml(htmlUrl);4 }

显示的结果部分截图：

jar参考：

3 org.jsoup

4 jsoup

5 1.8.3

至此测试完成。爬取页面或会涉及读取文本和图片，给出示例读取文本和下载图片到本地示例代码：

1 /**

2 *3 *@paramtext 要写入的文本4 *@paramfileName 文件名5 *@throwsIOException6 */

7 public static void Writer(String text,String fileName) throwsIOException {8 //生成的文件路径

9 String path = "F:\\" + fileName + System.currentTimeMillis() + ".txt";10 File file = newFile(path);11 if (!file.exists()) {12 file.getParentFile().mkdirs();13 }14 file.createNewFile();15 OutputStreamWriter fw = new OutputStreamWriter(new FileOutputStream(file), "UTF-8");16 BufferedWriter bw = newBufferedWriter(fw);17 bw.write(text);18 bw.flush();19 bw.close();20 fw.close();21 }

1 /**

2 *3 *@paramurlList 图片地址4 *@parampath 存储路径5 */

6 private static voiddownloadPicture(String urlList,String path) {7 URL url = null;8 try{9 url = newURL(urlList);10 DataInputStream dataInputStream = newDataInputStream(url.openStream());11 File file = newFile(path);12 if (!file.exists()) {13 file.getParentFile().mkdirs();14 }15 //file.createNewFile();

16 FileOutputStream fileOutputStream = newFileOutputStream(file);17 ByteArrayOutputStream output = newByteArrayOutputStream();18

19 byte[] buffer = new byte[1024];20 intlength;21

22 while ((length = dataInputStream.read(buffer)) > 0) {23 output.write(buffer, 0, length);24 }25 BASE64Encoder encoder = newBASE64Encoder();26 String encode = encoder.encode(buffer);//返回Base64编码过的字节数组字符串

27 fileOutputStream.write(output.toByteArray());28 dataInputStream.close();29 fileOutputStream.close();30 } catch(MalformedURLException e) {31 e.printStackTrace();32 } catch(IOException e) {33 e.printStackTrace();34 }35 }

当然接口入参可自定义。

weixin_39654245

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫