在采集项目中有需要模拟登陆的和不需要模拟登陆的。如果不熟悉java的webdriver可以参考:http://www.yiibai.com/selenium/,不需要模拟登陆的比较好采集,那么模拟登陆的呢?这个有几种方案,第一采用:HttpClient工具包可以模拟登陆,第二采用:Webdriver进行模拟等。如果碰到难登陆的可以采用webdriver进行模拟登陆:下面是我写的代码参考:
package main;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.ObjectOutputStream;
import java.util.Iterator;
import java.util.concurrent.TimeUnit;
/*import org.apache.http.client.CookieStore;
import org.apache.http.impl.client.BasicCookieStore;
import org.apache.http.impl.cookie.BasicClientCookie;*/
import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;
public class webdriverdemo {
public static void main(String[] args) throws InterruptedException, FileNotFoundException, IOException {
//使用代理ip测试
WebDriver driver = new FirefoxDriver();
driver.get("登陆页面地址");
System.out.println("one--");
Thread.sleep(5000);
driver.findElement(By.xpath("登陆也面的数据手机号")).sendKeys("phone");
System.out.println("two--");
Thread.sleep(2000);
driver.findElement(By.xpath("密码")).sendKeys("password");
System.out.println("three--");
Thread.sleep(2000);
driver.findElement(By.xpath("/html/body/div[2]/div[1]/div/div/div[2]/div/div[2]/div[2]/div[2]/div[5]")).click();
开始写采集规则
}
}
注意:红色部分为自己要写的东西,我使用的是火狐浏览器进行采集的。这里面配置需要注意一些。我使用的selenium-server-standalone-2.42.2.jar,selenium-java-2.42.2.jar这两个jar包,其中对火狐版本也有要求。火狐版本采用14.0.1,如果采用高版本火狐浏览器那么可能会报错程序。
如何查看对应浏览器版本号,网上有相关教程,http://blog.csdn.net/u011781257/article/details/53610212 ,我这边利用另外一个方法查看:
第一步解压selenium-java-2.42.2.jar,可以看到selenium-java-2.42.2\org\openqa\selenium\firefox下面有个压缩包名为:webdriver.xpi
第二步解压webdriver.xpi,可以看到里面install.rdf文件,利用记事本打开:可以看到如下,里面包含火狐最大版本和最下版本号:
<?xml version="1.0"?>
<RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:em="http://www.mozilla.org/2004/em-rdf#">
<Description about="urn:mozilla:install-manifest">
<em:id>fxdriver@googlecode.com</em:id>
<em:version>2.42.2</em:version>
<em:type>2</em:type>
<em:name>Firefox WebDriver</em:name>
<em:description>WebDriver implementation for Firefox</em:description>
<em:creator>Simon Stewart</em:creator>
<em:unpack>true</em:unpack>
<!-- Firefox -->
<em:targetApplication>
<Description>
<em:id>{ec8030f7-c20a-464f-9b0e-13a3a9e97384}</em:id>
<em:minVersion>3.0</em:minVersion>
<em:maxVersion>31.*</em:maxVersion>
</Description>
</em:targetApplication>
<!-- Platforms where we're not compiling the native events -->
<em:targetPlatform>Darwin</em:targetPlatform>
<em:targetPlatform>SunOS</em:targetPlatform>
<em:targetPlatform>FreeBSD</em:targetPlatform>
<!-- Platforms where we are -->
<em:targetPlatform>WINNT_x86-msvc</em:targetPlatform>
<em:targetPlatform>Linux</em:targetPlatform>
<!-- We're probably missing lots of platforms here -->
</Description>
</RDF>
火狐下载地址:http://ftp.mozilla.org/pub/firefox/releases/46.0.1/win64/zh-CN/