使用Chrome的Console获取某站学术文档粗糙版本

最新推荐文章于 2021-07-05 23:13:33 发布

薛定谔之死猫

最新推荐文章于 2021-07-05 23:13:33 发布

阅读量208

点赞数

分类专栏：脚本语言编程文章标签： Canvas RFC2397 JavaScript Ruby Base64

本文链接：https://blog.csdn.net/mscf/article/details/107293683

版权

脚本语言编程专栏收录该内容

53 篇文章 1 订阅

订阅专栏

有时候使用搜索引擎找到一些有用但不关键的学术文档，非商业目的需要下载下来参考一下，又因囊中羞涩付不起相对昂贵的费用，可以用类似以下的方法来应急。

首先使用Chrome打开网站，定位到免费的文档，并将文档展示为最大化（为了尽可能清晰），F12打开Chrome开发视图，在控制台输入类似的代码获取图片的原始文件数据。

(function (console) {
    console.export_canvas_base64 = function (page_count, name_pattern) {
        if (typeof page_count === "number" && typeof name_pattern === "string") {
            for (var i = 1; i <= page_count; i++) {
                data = document.getElementById("page_" + i.toString())
                if (!data) {
                    console.error("Console.export_canvas_base64: Empty image data")
                    return;
                }
                if (typeof data === "object") {
                    data = data.toDataURL("text/plain", 1)
                }
                var blob = new Blob([data], { type: "text/plain" }),
                    e = document.createEvent("MouseEvents"),
                    a = document.createElement("a")
                a.download = name_pattern + i.toString() + ".txt"
                a.href = window.URL.createObjectURL(blob)
                a.dataset.downloadurl = ["text/plain", a.download, a.href].join(":")
                e.initMouseEvent("click", true, false, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null)
                a.dispatchEvent(e)
            }
        }
    }
})(console)
//console.export_canvas_base64(8,"a")

至此，可使用浏览器获取到RFC2397定义的图片URL scheme，并将纯文本保存为本地文件，这时可以使用任何有Base64解码能力的脚本去加工这些纯文本文件为二进制的图片数据。

#encoding:utf-8
require 'base64'

##################################################
filePath = 'C:\\Users\\xxx\\Downloads\\'
fileCount = 8
filePattern = 'a'
##################################################

1.upto fileCount do |i|
  File.open(filePath + filePattern + i.to_s + '.txt','r') do |fs|
    s = fs.read
    s.gsub!("data:image/png;base64,",'')
    File.open(filePath + filePattern + i.to_s + '.png','wb') do |ft|
      ft.write Base64.decode64(s)
    end
  end
end

使用Adobe Acrobat将这些图片文件合并成完整的PDF文档，解决打印之类的需求是够用了。如果要进一步获取文本，可以使用OCR工具进行抽取，方法总是比困难多，这里就记录一下不展开了。

薛定谔之死猫

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
使用Chrome的Console获取某站学术文档粗糙版本

有时候使用搜索引擎找到一些有用但不关键的学术文档，非商业目的需要下载下来参考一下，又因囊中羞涩付不起相对昂贵的费用，可以用类似以下的方法来应急。首先使用Chrome打开网站，定位到免费的文档，并将文档展示为最大化（为了尽可能清晰），F12打开Chrome开发视图，在控制台输入类似的代码获取图片的原始文件数据。(function (console) { console.export_canvas_base64 = function (page_count, name_pattern) {
复制链接

扫一扫

专栏目录