【PDF.js应用】批量读取PDF文件中的文本，并添加索引到Elasticsearch中

最新推荐文章于 2022-08-25 14:36:08 发布

iShare_123

最新推荐文章于 2022-08-25 14:36:08 发布

阅读量1k

点赞数

分类专栏： PDFjs 文章标签： PDF全文检索 Elasticsearch 跨域配置 pdf.js Ajax

本文链接：https://blog.csdn.net/xyxing87/article/details/116238339

版权

PDFjs 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

为了实现PDF文件全文检索，需要将PDF文件所有文本内容添加进 Elasticsearch 中。

0、跨域配置

主要使用 ajax 发送 Post 请求，结合pdf.js实现。

发送请求时，存在跨域问题。找到配置文件 elasticsearch.yml，在文件末尾加上以下两行允许跨域访问，重启 Elasticsearch 服务器使修改生效。

http.cors.enabled: true
http.cors.allow-origin: "*"

1、文件目录结构（目录中有一些其他文件）：

2、代码

<!DOCTYPE html>
<html>
<head>
    <!-- <meta charset="UTF-8"> -->
    <meta charset="UTF-8" http-equiv="Access-Control-Allow-Origin" content="*" />
    <title>'Hello, world!' example</title>
</head>
<body>
    <h1>'Hello, world!' example</h1>
    <canvas id="the-canvas" style="border: 1px solid black; direction: ltr;"></canvas>
    <script src="./node_modules/pdfjs/build/pdf.js"></script>
    <script src="jquery.js"></script>
    <script id="script">

        // 如果从远程服务器上提供了绝对 URL，需要在服务器上配置好 CORS header。
        var url = './F资料.pdf';

        // 必须指定 workerSrc 属性。
        pdfjsLib.GlobalWorkerOptions.workerSrc =
            './node_modules/pdfjs/build/pdf.worker.js';

        // 异步下载PDF
        var loadingTask = pdfjsLib.getDocument(url);

        loadingTask.promise.then(function (pdf) {
            //
            // 获取第一页内容
            //
            for (i = 0; i < pdf._pdfInfo.numPages; i++) {
                pdf.getPage(i + 1).then(function (page) {
                    // console.log(Object.keys(page));            
                    // 0: "_pageIndex"            
                    // 1: "_pageInfo"            
                    // 2: "_ownerDocument"
                    // 3: "_transport"            
                    // 4: "_stats"            
                    // 5: "_pdfBug"            
                    // 6: "commonObjs"            
                    // 7: "objs"
                    // 8: "cleanupAfterRender"            
                    // 9: "pendingCleanup"            
                    // 10: "_intentStates"            
                    // 11: "destroyed"

                    console.log('pageIndex:' + (page._pageIndex));
                    // 总页数
                    console.log("Total pages: " + pdf._pdfInfo.numPages);

                    let strs = [];
                    page.getTextContent().then(

                        function (textContent) {
                            for (let j = 0; j < textContent.items.length; j++) {
                                strs.push(textContent.items[j].str);
                            }
                            console.log(strs.join(""));

                            var character = { curIndex: page._pageIndex, content: strs.join("") }
                            $.ajax({
                                type: "post",
                                contentType: 'application/json',
                                data: JSON.stringify(character),
                                url: `http://localhost:9200/character/character`,
                                success() {
                                    console.log('OK.');
                                },
                                error(data) {
                                }
                            })
                        });
                });
            }
        });  // loadingTask.promise.then...
    </script>
</body>

</html>

3、运行效果

iShare_123

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
【PDF.js应用】批量读取PDF文件中的文本，并添加索引到Elasticsearch中

为了实现PDF文件全文检索，需要将PDF文件所有文本内容添加进 Elasticsearch 中。0、跨域配置主要使用 ajax 发送 Post 请求，结合pdf.js实现。发送请求时，存在跨域问题。找到配置文件 elasticsearch.yml，在文件末尾加上以下两行允许跨域访问，重启 Elasticsearch 服务器使修改生效。http.cors.enabled: truehttp.cors.allow-origin: "*"1、文件目录结构（目录中有一些其他文件）：2
复制链接

扫一扫