使用jsoup获取微信公众号文章发布时间

最新推荐文章于 2024-12-23 13:54:12 发布

quifar123

最新推荐文章于 2024-12-23 13:54:12 发布

阅读量1.8k

点赞数 2

分类专栏： spring boot 文章标签：公众号文章

本文链接：https://blog.csdn.net/chen846262292/article/details/103454873

版权

spring boot 专栏收录该内容

4 篇文章

订阅专栏

从weixin.sogou.com找到的公众号文章，查看源码可以发现，文章的标题，内容，作者，微信号，封面图，都可以轻松获取；

但是唯独这个发布时间节点为空，代码如下

<em id="publish_time" class="rich_media_meta rich_media_meta_text"></em>

既然jsoup节点直接获取不到，那就在script脚本中获取吧；

文章的发布时间是在js中赋值的，另外一种思路是把所有的script脚本拿下来；再截取即可

js赋值的代码如下

function _typeof(e){
return e&&"undefined"!=typeof Symbol&&e.constructor===Symbol?"symbol":typeof e;
}
!function(e){
if("object"===("undefined"==typeof module?"undefined":_typeof(module)))module.exports=e;else{
if(window.__second_open__)return;
var t="1575860164",n="1575539255",s="2019-12-05";
e(t,n,s,document.getElementById("publish_time"));
}
}(function(e,t,n,s){
var i="",o=86400,f=new Date(1e3*e),r=1*t,l=n||"";
f.setHours(0),f.setMinutes(0),f.setSeconds(0);
var c=f.getTime()/1e3;
f.setDate(1),f.setMonth(0);
var u=f.getTime()/1e3;
if(r>=c)i="今天";else if(r>=c-o)i="昨天";else if(r>=c-2*o)i="前天";else if(r>=c-3*o)i="3天前";else if(r>=c-4*o)i="4天前";else if(r>=c-5*o)i="5天前";else if(r>=c-6*o)i="6天前";else if(r>=c-14*o)i="1周前";else if(r>=u){
var d=l.split("-");
i="%s月%s日".replace("%s",parseInt(d[1],10)).replace("%s",parseInt(d[2],10));
}else i=l;
s&&(s.innerText=i,setTimeout(function(){
s.onclick=function(){
s.innerText=l;
};
},10));
});

可以看到 ",s=""后面就是我们需要的文章发布时间；

截取代码如下

/**
     * 从html 内容中获取文章发文时间
     *
     * @param document html 文档对象 （该文章的所有html对象）
     * @return
     */
    public static String getPublishTime(Document document) {
        Elements scripts = document.select("script");
        for (Element script : scripts) {
            String html = script.html();

            // 需要获取的节点
            if (html.contains("document.getElementById(\"publish_time\")")) {
                int fromIndex = html.indexOf("s=\"");

                // StrUtil 工具为 hutool.cn工具
                return StrUtil.subWithLength(html, fromIndex + 3, 10);
            }
        }
        return null;
    }

下边是文章其他内容的获取代码

/**
     * 从html 内容中获取公众号的标题
     *
     * @param document html 文档对象
     * @return
     */
    public static String getTitle(Document document) {
        Elements titles = document.getElementsByClass("rich_media_title");
        return StrUtil.trim(titles.get(0).text());
    }

    /**
     * 从html 内容中获取公众号的封面图
     *
     * @param document html 文档对象
     * @return
     */
    public static String getTitleImage(Document document) {
        Elements metas = document.getElementsByAttributeValue("property", "og:image");
        Element element = metas.get(0);
        if (element == null) {
            return "";
        }
        String content = element.attr("content");
        return StrUtil.trim(content);
    }

    /**
     * 从html 内容中获取文章的作者
     *
     * @param document html 文档对象
     * @return
     */
    public static String getWeChatName(Document document) {
        Element author = document.getElementById("js_name");
        Assert.notNull(author, "查找不到文章公众号，请输入合法的公众号文章连接");
        return StrUtil.trim(author.text());
    }

    /**
     * 从html 内容中获取文章公众号
     *
     * @param document html 文档对象
     * @return
     */
    public static String getWeChatAccount(Document document) {
        Elements metaValues = document.getElementsByClass("profile_meta_value");
        Assert.notEmpty(metaValues, "查找不到文章公众号，请输入合法的公众号文章连接");
        String text = metaValues.get(0).text();
        Assert.notBlank(text, "查找不到文章公众号，请输入合法的公众号文章连接");
        return StrUtil.trim(text);
    }