Java Selenium 爬取微信公众号文章图片.md

最新推荐文章于 2024-08-09 07:53:46 发布

Biturd

最新推荐文章于 2024-08-09 07:53:46 发布

阅读量982

点赞数

分类专栏：爬虫 # Java 文章标签：爬虫

本文链接：https://blog.csdn.net/qq_42873554/article/details/106028527

版权

Java 同时被 2 个专栏收录

66 篇文章 1 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

项目地址结尾

一、前期准备

1. 下载selenium的驱动

Chrome

http://chromedriver.storage.googleapis.com/index.html

Firefox

https://github.com/mozilla/geckodriver/releases/

IE

http://selenium-release.storage.googleapis.com/index.html

下载好驱动然后放到、对应的浏览器的启动目录

2. 创建一个Maven项目

并在pom.xml中导入

    <dependencies>
        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-java</artifactId>
            <version>3.141.59</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter</artifactId>
            <version>RELEASE</version>
            <scope>compile</scope>
        </dependency>
    </dependencies>

二、先梳理一下思路。

爬这个页面的所有链接文章里面的图片。

1.先爬取所有的链接存为一个 links [List]

2.再爬取所有links里面的picLinks [List<List >]

3.根据所有的链接下载图片

三、排坑

1.先说为啥我没用request库。

因为我发现微信公众号的链接太诡异了，跳转太多，懒得去找了。有大佬也可以给我说一下咋得到文章真实链接了。

推荐一个结合charles、或者chrome，的curl转代码的工具

https://curl.trillworks.com/

2.文章列表业进度条是动态加载的

可以通过selenium模拟js向下滚动，滚动个三四下，就好了。注意要停个一点时间。

JavascriptExecutor driver_js = (JavascriptExecutor) webDriver;
driver_js.executeScript("window.scrollTo(0, document.body.scrollHeight)");

            for(int i=0;i<3;i++) {
                driver_js.executeScript("window.scrollTo(0, document.body.scrollHeight)");
                try {
                    Thread.sleep(3000); // 等0.3秒
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }

3. 用无头浏览器进行操作 [就是不让浏览器显示出来]

// 我这里也设置了一下环境变量避免找不到

    public static WebDriver getChromeDriver(){
        if (webDriver == null) {
            System.setProperty("webdriver.chrome.driver", "C:\\Users\\lenovo\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe");
            // FirefoxDriver()            火狐浏览器
            //谷歌浏览器
            ChromeOptions chromeOptions = new ChromeOptions();
            chromeOptions.addArguments("-headless");
            webDriver = new ChromeDriver(chromeOptions);
        }
        return webDriver;
    }