Java. How to use headless browsers for crawling web and scraping data from website.--转

https://www.linkedin.com/pulse/java-how-use-headless-browsers-crawling-web-scraping-data-taluyev/

Did you ever think to implement software to scrape data from web pages? I guess everyone could think about crawling web.

The simplest way to get data from remote page is run your preferable web browser, load target web page, select needed text, copy and past text into text editor for the following data transformations. Joke :)

To be honest how to automate this routine process?  Let's determine primary tasks need to be solved for implementing our crawler.

  1. Load data from remote host. It is not a secret how to to this...
  2. Parse loaded html and build DOM (Document Object Model).
  3. Get data by traversing DOM or using CSS selectors. 
  4. Save or pass data for other tasks.

Parsing static HTML is quite "easy task". There are Java libraries which do this task very well. I would recommend to take a look at http://jsoup.org It's enough in simple case. 

How to be with hidden HTML which is created by Javascript? We need to use browser or implement browser :) Fortunately we do not have to implement  our own browser if we want just to implement crawler. These browsers are already implemented. Our herous: http://phantomjs.orghttps://slimerjs.org

How to organize communication between Java program and headless browser? On the stage appears "Ghost" driver. The both browsers support this driver out of the box. Ghost driver is "relative" of WebDriverWebDriver is well known among test-engineers - a lot  of code examples and manuals. We are free to use Maven for integration GHost driver into crawler application. 

There are difference between http://phantomjs.orghttps://slimerjs.org. It is well documented on FAQ page of Slimerjs project.

Makes sense to consider Javascript framework casperjs.org - is a navigation scripting & testing utility for PhantomJS and SlimerJS written in Javascript.

What if we do not want to use not PhantomJS nor SlimerJS? There are alternatives:

At this point I propose to make a pause. Now we have enough information to dive into implementing of web crawlers applications.

Analytics starts from data gulps :)

Please like and share if you find my arcticle usefull :-)

转载于:https://www.cnblogs.com/davidwang456/p/8696512.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值