python科研向论文检索篇——提取PDF文字以供全文信息检索

ruierx

已于 2022-02-22 15:28:38 修改

阅读量3.9k

点赞数 4

分类专栏： python学习科研向数据处理文章标签： python

于 2022-02-22 15:26:27 首次发布

本文链接：https://blog.csdn.net/ruierx/article/details/123069184

版权

python学习同时被 2 个专栏收录

4 篇文章

订阅专栏

科研向数据处理

4 篇文章

订阅专栏

本文介绍如何使用Python库pdfplumber从PDF文件中提取文本，实现本地论文全文检索，以提高科研效率。通过实例展示了工具安装、基本用法及一个实际操作流程，包括关键词输入、检索模式选择和结果保存。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

python科研向论文检索篇——提取PDF文字以供全文信息检索

作者：ruierx

项目背景

写论文期间，下载了很多论文，但是很多读过一次就忘记了，隐约有印象在某篇论文里见过一个现象，但就是想不起来在哪里看的，这个时候如果有一个本地可以根据关键词检索所有论文全文的工具就好了。刚好，嘿嘿，咱会python，那就干吧！

流程图

工具包安装

本文使用pdfplumber实现PDF文件的文本提取，注意此包无法识别扫描生成的图片PDF。
安装 ------ pip install pdfplumber
另外为了复制拷贝文件方便，引入了python自带包shutil。

pdfplumber基本用法

import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

pdfpliumber.open()用于打开pdf文件，此时该PDF文件生成一个类，可以使用它的属性调用它的内容，比如：

.pages为每个页面组成的列表，可用列表的方式调用每一页pdf，每一页又是一个.page的类，有属性.page_number表示页码，.width/.height表示页宽和页高等
一些常用的方法有:
page.extract_text()------提取页面中的文本，存储为一个字符串。
page.extract_words()------返回所有的单词及相关信息
page.extract_tables()------提取页面的表格

加亿点点细节

import pdfplumber
import shutil
from os import listdir, chdir, getcwd, system, path, mkdir

chdir("./1/")
raw_list = listdir()
print('文件总数：', len(raw_list))

while 1:
    keyword = input("请输入搜索关键字：").strip()
    insure = input("y/n")
    if insure == 'y':
        break

while 1:
    print("请选择检索模式:")
    print("1. 重新检索")
    print("2. 继续上次的检索")
    print("3. 将上次检索的结果存至关键词文件夹")
    choice = input("--->").strip()
    if choice == '1':
        filelist = listdir()
        with open("process.txt", 'w') as pro:
            pro.truncate()
        break
    elif choice == '2':
        with open("../process.txt", 'r') as pro:
            filelist = pro.read().strip().split('\n')
            for f in filelist:
                raw_list.remove(f)
        filelist = raw_list
        break
    elif choice == '3':
        with open("../process.txt", 'r') as pro:
            filelist = pro.read().strip().split('\n')
            for f in filelist:
                raw_list.remove(f)
        filelist = raw_list

        isExists = path.exists(f'../{keyword}')
        if not isExists:
            mkdir(f'../{keyword}')
        for file in filelist:
            shutil.copyfile(file, f'../{keyword}/{file}')
        system("pause")
        exit(f"{keyword}--相关文件已存储至相应文件夹！")
    else:
        print('输入错误，重新输入！')

print("剩余文件数：", len(filelist))

n = 0
ns = len(filelist)
for file in filelist:
    n += 1
    print(f"正在检索({n}/{ns})中是否含有{keyword}------>" + file)
    with pdfplumber.open(file) as pdf:
        # firstpage = pdf.pages[0]
        # print(type(firstpage.extract_text()))
        sign = 0
        for page in pdf.pages:
            text = page.extract_text()
            if keyword in text:
                with open("../result.txt", 'a+') as res:
                    res.write(keyword + '--->\n' + file +
                              '----->页数:' + str(page.page_number) + '\n\n')
                print(keyword + '--->\n' + file +
                      '----->页数:' + str(page.page_number) + '\n\n')
                sign = 1
                break
    if sign == 0:
        with open("../process.txt", 'a+') as pro:
            pro.write(file + '\n')
        print("无相关信息！")

system("pause")