通过关键字在文献中查询并提取所在句

import pandas as pd

1.遍历文件夹下所有文件名,获取各个文件地址

import os 
from os import path 
# 定义一个函数
def scaner_file (url):
    # 遍历当前路径下所有文件
    file  = os.listdir(url)
    list=[]
    for f in file:
        # 字符串拼接
        real_url = path.join (url , f)
        # 保存到数组
        list.append(real_url)
    return list

2.1通过地址读取word文件内容,并按句分割

import docx
# pip install python-docx
# 按照地址读取文档
def read_data(url):
    file=docx.Document(url)
    # 按照段落读取文档内容
    data=[]
    for para in file.paragraphs:
        data.append(para.text)
    data=data[0].split('.')
    return data
read_data("./papers/3.docx")
['The Anti-Atlas Mountains constitute a Late Proterozoic suture zone produced by northward subduction of oceanic lithosphere culminating in the Pan-African orogeny',
 ' Southward migration of thrust slices associated with the destruction of the fbrearc terrane resulted in the uplift and erosion of previously deposited basin sediments',
 ' These sediments were subsequently reincorporated into collisional basin deposits of the Tiddiline Formation',
 'The Tiddiline Formation consists of coarsening-upwards sequences of maroon siltstones, sandstones and intraforma- tional conglomerates',
 ' These rocks unconfbrmably overlie metamorphosed volcaniclastic rocks of the relict fbrearc basin and accretionary terrane',
 ' Syn- and post-deposlional deformation has resulted in folding about gently-plunging fold axes',
 ' Folds were subsequently cut by strike-slip faults that strike at a high angle to the basin axis',
 ' Deformation of the Tiddiline Formation is attributed to transpressional suturing of the relict fbrearc terrane to the West African Craton to the south',
 ' Collisional basins of the Anti-Atlas Mountains serve as ancient analogs fbr the destruction of fbrearc basins in an oblique- convergent margin setting, such as those of the Western Pacific region',
 '']

2.2通过地址读取文献pdf文件中的摘要内容,并按句分割—未使用

可惜我要整理的文献格式不统一,所以并没有用到这一模块,而是先把需要查询的段落摘要出来成word再运行

pdfplumber.pdf中包含了.metadata和.pages两个属性。
.metadata是一个包含pdf信息的字典。
.pages是一个包含页面信息的列表。

每个pdfplumber.page的类中包含了几个主要的属性。
.page_number 页码
.width 页面宽度
.height 页面高度
.objects/.chars/.lines/.rects 这些属性中每一个都是一个列表,每个列表都包含一个字典,每个字典用于说明页面中的对象信息, 包括直线,字符, 方格等位置信息。

.extract_text() 用来提页面中的文本,将页面的所有字符对象整理为的那个字符串
.extract_words() 返回的是所有的单词及其相关信息
.extract_tables() 提取页面的表格
.to_image() 用于可视化调试时,返回PageImage类的一个实例

import pdfplumber
# pip install pdfplumber
def read_pdf(url):
    with pdfplumber.open(url) as pdf:
        # 获取pdf第1页
        first_page = pdf.pages[0]
        str = first_page.extract_text()# <class 'str'>
        strat=str.find('ABSTRACT')
        end=str.find('INTRODUCTION')
        # 字符串切片,不保存'ABSTRACT',只保存内容
        data = str[strat+10:end]
#         print(data)
        # 删除字符串中的'\n'
        data = data.replace('\n','')
#         print(data)
        # 按照句子拆分
        data = data.split('.')
#         print(data)
        return data

3.获取关键字

import docx
# pip install python-docx
# 按照地址读取文档
def read_keywords(url):
    file=docx.Document(url)
    # 按照段落读取文档内容
    data=[]
    for para in file.paragraphs:
        data.append(para.text)
    return data
read_keywords('./papers/key_words.docx')
['Continental rifts',
 'Nascent ocean basins',
 'Property value',
 'Intraplate continental margins',
 'Intracratonic basins ',
 'Continental platforms ',
 'Active ocean basins',
 'Oceanic islands',
 'seamounts',
 'aseismic ridges',
 'and plateaus',
 'Dormant ocean basins ',
 'Transtensional basins',
 'Transpressional basins',
 'Transrotational basins',
 'Trenches ',
 'Trench-slope basins',
 'Forearc basins',
 'Intraarc basins',
 'Backarc basins',
 'Retroforeland basins',
 'Remnant ocean basins',
 'Proforeland basins ',
 'Wedgetop basins',
 'Hinterland basins',
 'Aulacogens',
 'Impactogens',
 'Collisional broken foreland',
 'Halokinetic basins ',
 'Bolide basins',
 'Successor basins',
 'Shelf-slope-rise configuration',
 'Transform configuration ',
 'Embankment configuration',
 'Oceanic intraarc basins ',
 'Continental intraarc basins',
 'Oceanic backarc basins',
 'Continental backarc',
 'Retroarc foreland basins ',
 'Collisional retroforeland',
 'Broken-retroforeland']

4.处理数据:筛选含有关键字的内容,并保存

def handle_data(url):
    # 按照地址读取文档 
    data = read_data(url)
    print('当前查询url=',url)
    # 查找关键词
    keywords = read_keywords('./papers/key_words.docx')
    ret_list=[]
    # 筛选含有特数字的句子
    for i in data:
        for j in keywords:
#             print('当前查询句子为=',i,'\n查询关键字为=',j)
            # 忽略大小写
            if i.casefold().find(j.casefold())!=-1:
                # 添加到列表并且句末添加句号和回车
                ret_list.append(i+'.\n')
                print(j,"原句:",i)
                break
    return ret_list 

5.主函数

# 循环遍历所有文件
for url in urls:
    # 查询是否有关键字
    list = handle_data(url)
    if list == []:
        print('查找文件:',list,'未找到')
    else:
        # 切片删除.docx
        # url[0:-5]
        # list[0].to_csv(url[0:-5]+'.txt',sep=' ',index=0,header=0)
        f = open(url[0:-5]+".txt",'w',encoding = 'utf-8')
        f.write(list[0])   #将字符串写入文件中
        f.close()
当前查询url= ./papers/1.docx
Retroarc foreland basins  原句:  The source regions for retroarc foreland basins generally, and the Magallanes-Austral Basin specifically, can be broadly divided into (1) the magmatic arc, (2) the fold-and-thrust belt, and (3) sources around the periphery of foreland flexural subsidence
当前查询url= ./papers/2.docx
查找文件: [] 未找到
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

PCGuo999

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值