动态网页解析 Selenium

最新推荐文章于 2024-08-14 13:28:24 发布

daisyZH

最新推荐文章于 2024-08-14 13:28:24 发布

阅读量1.3k

点赞数

分类专栏：网页解析文章标签：网页解析

网页解析专栏收录该内容

3 篇文章 0 订阅

订阅专栏

转自：http://blog.csdn.net/dreamd1987/article/details/8202111?reload

对于静态网页的解析，我们一般使用Jsoup就可以了

但是对已动态加载的网页，Jsoup就不可以了！

那么我们如何解析并抓取网页上的信息呢？

看了网上朋友的讨论，我打算模拟一个浏览器然后通过操作浏览器来得到新的网页信息。

最终我选择了Selenium来模拟浏览器。

其实Selenium是一个测试浏览器性能的工具，用来爬虫有点大材小用了！

Selenium官网地址：http://seleniumhq.org/

大家可以去官网上产科selenium的安装和使用

我们一般使用Selenium RC的工具包来对浏览器进行操作。

安装好包后，我们给出一个小例子：

[html]view plaincopy 
   
 package com.example.tests;  
 // We specify the package of our tests  
   
 import com.thoughtworks.selenium.*;  
 // This is the driver's import. You'll use this for instantiating a  
 // browser and making it do what you need.  
   
 import java.util.regex.Pattern;  
 // Selenium-IDE add the Pattern module because it's sometimes used for  
 // regex validations. You can remove the module if it's not used in your  
 // script.  
   
 public class NewTest extends SeleneseTestCase {  
 // We create our Selenium test case  
   
       public void setUp() throws Exception {  
         setUp("http://www.google.com/", "*firefox");  
              // We instantiate and start the browser  
       }  
   
       public void testNew() throws Exception {  
            selenium.open("/");  
            selenium.type("q", "selenium rc");  
            selenium.click("btnG");  
            selenium.waitForPageToLoad("30000");  
            assertTrue(selenium.isTextPresent("Results * for selenium rc"));  
            // These are the real test steps  
      }  
 }  

这个是使用的类。

我们可以编写一个主程序，如下：

[html]view plaincopy 
   
 package Test1;  
   
 import java.net.UnknownHostException;  
   
 import com.mongodb.BasicDBObject;  
 import com.thoughtworks.selenium.*;  
 //This is the driver's import. You'll use this for instantiating a  
 //browser and making it do what you need.  
 import org.jsoup.Jsoup;  
 import org.jsoup.helper.Validate;  
 import org.jsoup.nodes.Document;  
 import org.jsoup.nodes.Element;  
 import org.jsoup.select.Elements;  
   
 import java.util.LinkedList;  
 import java.util.Queue;  
 import java.util.regex.Pattern;  
 //Selenium-IDE add the Pattern module because it's sometimes used for  
 //regex validations. You can remove the module if it's not used in your  
 //script.  
   
 @SuppressWarnings("deprecation")  
 public class NewTest extends SeleneseTestCase {  
 //We create our Selenium test case  
   
     public String url;  
    public void setUp() throws Exception {  
      setUp("https://foursquare.com/v/singapore-zoo/4b05880ef964a520b8ae22e3", "*chrome");  
      //selenium.waitForPageToLoad("30000");  
           // We instantiate and start the browser  
    }  
 public void testNew() throws Exception {  
           
         selenium.open("https://foursquare.com/v/singapore-zoo/4b05880ef964a520b8ae22e3");  
         selenium.windowMaximize();  
 public static void print(String msg, Object... args) {  
            System.out.println(String.format(msg, args));  
        }  
        public static void gettips(Document doc){  
              
            Elements tips = doc.select(".tipText");  
             int count = 0;  
             //BasicDBObject document4 = new BasicDBObject();  
             for (Element link : tips){  
                   
                 String str2 = new String(link.text());  
                 count++;  
                 String tempint = String.valueOf(count);  
                 //document4.put(tempint, str);  
                 print("%s \r\n", str2);  
             }  
        }  
 }  

运行后，程序会打开firefox的浏览器，然后会自动操作你对他的设计。

具体的API可以去这里查看：

http://release.seleniumhq.org/selenium-remote-control/0.9.2/doc/java/

同时，既然我们需要抓取网页，那么我们一定需要了解Xpath，因为selenium中需要找寻button等按钮等都要用到Xpath的选择模式。

之后仔细研究后我会争取写一篇Xpath的使用方式，其实网上的教程也很多。

同样，如何大家不习惯Selenium的找寻模式。可以按照以下的代码：

[html]view plaincopy 
   
 String str = selenium.getHtmlSource();  
 //get the source of html for the web.  
 doc = Jsoup.parse(str);  

第一行代码抓取当前网页的html代码

的二行代码将html代码转换冲了jsoup可以读取的doc变量。

之后就可以使用大家习惯的Jsoup去解析网页了。

不过这样做有点浪费，因为selenium已经可以完全解析了。

同样selenium可以模拟登陆操作等。

总之你在浏览器上可以做的工作，selenium都可以替你完成。

daisyZH

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录