一、Selenium简介
Selenium官方文档:https://www.selenium.dev/documentation/en/
传统爬虫爬取静态页面比较容易,但如果是动态页面就很难爬取。因此Selenium与爬虫结合,来解决这个问题。Selenium 相当于直接在浏览器中运行,就像真实用户所做的一样。
想了解更多请前往这个链接。http://www.selenium.org.cn/
二、谷歌浏览器版本号查看
右键此处,点击关于项
查看内核号,主要看对应的前三项数,即 75.0.3370 这三项
三、ChromeDriver下载
淘宝镜像下载地址:http://npm.taobao.org/mirrors/chromedriver/
可以看到下面符合 75.0.3370 开头的有三项,我们选择最近更新的75.0.3370.140进入后选择对应系统版本。
此电脑使用的是Windows。
将chromedriver.exe放入到chrome浏览器所在目录。
四、Java代码样例
public class FlightProcess implements PageProcessor {
//失败重连次数3次,间隔时间100ms
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
@Override
public void process(Page page) {
//配置驱动路径
System.getProperties().setProperty("webdriver.chrome.driver", "D:/Chrome/chromedriver.exe");
//配置chrome
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setBinary("D:/Chrome/ChromeCore/ChromeCore.exe");
//启动驱动
WebDriver webDriver = new ChromeDriver(chromeOptions);
webDriver.get(page.getUrl().toString());
WebElement webElement = webDriver.findElement(By.xpath("/html"));
System.out.println("result:");
System.out.println(webElement.getAttribute("outerHTML"));
webDriver.close();
webDriver.quit();
}
@Override
public Site getSite() {
······
}
}
五、遇到的异常
- ChromeDriver与Chrome版本不匹配
error
org.openqa.selenium.SessionNotCreatedException: session not created: This version of ChromeDriver only supports Chrome version 81
解决方法:重新下载对应驱动,对应好版本号
- 获取不到Chrome浏览器位置,需自行配置
error
org.openqa.selenium.WebDriverException: unknown error: cannot find Chrome binary
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.openqa.selenium.remote.W3CHandshakeResponse.lambda$errorHandler$0(W3CHandshakeResponse.java:62)
at org.openqa.selenium.remote.HandshakeResponse.lambda$getResponseFunction$0(HandshakeResponse.java:30)
at org.openqa.selenium.remote.ProtocolHandshake.lambda$createSession$0(ProtocolHandshake.java:126)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958)
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464)
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:128)
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:74)
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:136)
at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:83)
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:552)
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:213)
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:131)
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:181)
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:168)
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:123)
at com.cjm.flightspider.webmagic.FlightProcess.process(FlightProcess.java:28)
at us.codecraft.webmagic.Spider.onDownloadSuccess(Spider.java:414)
at us.codecraft.webmagic.Spider.processRequest(Spider.java:406)
at us.codecraft.webmagic.Spider.access$000(Spider.java:61)
at us.codecraft.webmagic.Spider$1.run(Spider.java:320)
at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
解决方法:使用ChromeOptions类进行配置,保证浏览器所在文件路径正确
- URL错误
如果url格式不正确会导致以下异常。
error
org.openqa.selenium.InvalidArgumentException: invalid argument
解决方法:检查url格式。
- 元素找不到
error
org.openqa.selenium.NoSuchElementException: no such element: Unable to locate element: {"method":"xpath","selector":"//flight_card_content"}
解决方法:检查元素名。