【python】分析论文中频率最高的词、自助打开pdf转成txt并写回

本文链接：https://blog.csdn.net/StrongerIrene/article/details/102600336

emmm，写的是打开文件夹下的pdf，输入到txt中，并且对单个文章分析，以及汇总分析，分析出出现次数最多的词组。

1、分析频率是网上的有自己改，去读pdf见

https://blog.csdn.net/MrLevo520/article/details/52136414 （写的好）

2、转义，r或者全部用\\

https://blog.csdn.net/caibaoH/article/details/78335094

3、有关文件读写

https://blog.csdn.net/qq_41094541/article/details/92131216

想要每次都更新，可以直接用a+ w+

但是pdfminer里，很鬼就是它是一段一段的写进去的？如果变w+了所以就只有一两行了

r：以只读方式打开文件；

r+：打开文件用于读写，指针位于文件的开头；

w+：打开文件用于读写，如果文件存在则打开文件，将原有内容删除；文件不存在则创建文件；

a：打开文件用于追加，指针放在文件末尾，新写入的内容会接在已有内容后面；

a+:打开一个文件用于读写，如果文件存在，则追加模式；文件不存在，新建文件，用于读写；

更详细了解：https://www.runoob.com/python/python-func-open.html

4、删除：

其实a+ w+直接自带了

5、关于“-” 问题，气气，搞了好久。

然而qvq。。。各种错错。。

然后就是好慢-。-

不知道这个意义在哪为了手不累一点吗。。

有个60MB 的，我直接用手copy下来了，手工其实还行。。。但是批量化操作。。。就当practice，我写代码画的时间跟

参考答案.:https://vizsec.dbvis.de/

直接在线生成: https://wordart.com/create( 不过这个网站没有处理"-"

这个网站也有问题：stemming，去掉这个选项好一点，

不然 use会给提取成user、users，后面两种都算进了use里。

请看 https://blog.csdn.net/m0_37744293/article/details/79065002

词形还原（lemmatization），是把一个任何形式的语言词汇还原为一般形式（能表达完整语义），而词干提取（stemming）是抽取词的词干或词根形式（不一定能够表达完整语义）。词形还原和词干提取是词形规范化的两类

重要方式，都能够达到有效归并词形的目的，二者既有联系也有区别。stem是词干的意思

在应用领域上，同样各有侧重。虽然二者均被应用于信息检索和文本处理中，但侧重不同。词干提取更多被应用于信息检索领域，如Solr、Lucene等，用于扩展检索，粒度较粗。词形还原更主要被应用于文本挖掘、自然语言处理，用于更细粒度、更为准确的文本分析和表达

最后粑粑一样的代码

用的时候只需改路径

如果在一个文件夹下多次用就把all.txt 删除其他都不用管(会自助动的)

# coding:UTF-8
import re
import os
import io
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.layout import *
from pdfminer.converter import PDFPageAggregator


# 首先来写如何统计一个英文文本中单词出现的频率（
def sumFrequency(pathStr,filename):
    file=io.open(pathStr+filename+".txt",'r',encoding='utf-8')
    wordCounts={}    # 先建立一个空的字典，用来存储单词 和相应出现的频次
    normalCounts={"the","and","a","of","to","in","for","is","be",
                  "have","their","all","also","more","which","each",
                  "not","at","that","an","by","as","on","we","can",
                  "are","such","our","or","this","from","with","it",
                  "these","other"}
    count=100         # 前多少条（按照单词出现频次从高到低）
    rec=""
    mind=0
    for line in file:
        if(len(line)<=2):
            continue
        if line[-2] == "-": # 如果有因为跨行产生的连接符，去掉它。
            if mind==0:
                mind=1
                rec=line
            else:
                rec=rec[:-2]+line
                mind=1
        else:
            if(mind==1):
                line=rec[:-2]+line
                mind=0
            lineprocess(line.lower(),wordCounts,normalCounts)
    if mind==1:  # 避免最后一句话因为连接符没有做判断
        lineprocess(line.lower(), wordCounts, normalCounts)

    # 对于每一行都进行处理，调用lineprocess()函数，参数就是从file文件读取的一行
    items0=list(wordCounts.items())       # 把字典中的键值对存成列表，形如：["word":"data"]
    items=[[x,y] for (y,x) in items0]     # 将列表中的键值对换一下顺序，方便进行单词频次的排序 就变成了["data":"word"]
    items.sort()            # sort()函数对每个单词出现的频次按从小到大进行排序
    strrrr=""
    for i in range(len(items)-1,len(items)-count-1,-1):   # 上一步进行排序之后 对items中的元素从后面开始遍历 也就是先访问频次多的单词
        strrrr=strrrr+(items[i][1]+"\t"+str(items[i][0])+ '\n')
    print(filename)
    with io.open('%s' % (pathStr+filename+"_result.txt"), 'w+', encoding='utf-8') as f:
        f.write(filename+'\n'+'**********'+'\n'+strrrr+'\n')
        f.close()


def lineprocess(line,wordCounts,normalCounts):
    for ch in line:   # 对于每一行中的每一个字符 对于其中的特殊字符需要进行替换操作
        if ch in "~@#$%^&*()_+=<>?/,.:;{}[]|·1234567890\'""":  # 【这里删掉了-
            line=line.replace(ch,"")
    words=line.split()  # 替换掉特殊字符以后 对每一行去掉空行操作,也就是每一行实际的单词数量  对的哈，我们split（）一下，得到的就是那个数组、
    for word in words:
        if word in normalCounts:
           word=word
        elif word in wordCounts:
           wordCounts[word]+=1
        else:
            wordCounts[word]=1
    # 这个函数执行完成之后整篇文章里每个单词出现的频次都已经统计好了

def Pdf2Txt(Path,Save_name):
    print(Save_name)
    parser = PDFParser(Path)  # 来创建一个pdf文档分析器
    document = PDFDocument(parser) # 创建一个PDF文档对象存储文档结构
    if not document.is_extractable:  # 检查文件是否允许文本提取
        raise PDFTextExtractionNotAllowed
    else:
        rsrcmgr=PDFResourceManager() # 创建一个PDF资源管理器对象来存储共赏资源
        laparams=LAParams() # 设定参数进行分析
        device=PDFPageAggregator(rsrcmgr,laparams=laparams) # 创建一个PDF设备对象  device=PDFDevice(rsrcmgr)
        interpreter=PDFPageInterpreter(rsrcmgr,device) # 创建一个PDF解释器对象
        for page in PDFPage.create_pages(document):  # 处理每一页
            interpreter.process_page(page)
            layout=device.get_result()  # 接受该页面的LTPage对象
            for x in layout:
                if(isinstance(x,LTTextBoxHorizontal)):
                    rec=""
                    newText=x.get_text()
                    print(newText+"a---------------------")
                    for ch in x.get_text():  # 对于每一行中的每一个字符 对于其中的特殊字符需要进行替换操作
                        if (ch =='-' and rec==' '):
                            print(ch)
                            newText = newText.replace(ch, "")
                            newText = newText.replace(rec, "")
                        rec=ch
                    with io.open('%s'%(Save_name),'a+',encoding='utf-8') as f:
                        f.write(newText + '\n') # 【写入文件，直接进了txt】
                        f.close()
                    with io.open('%s' % (pathStr+"all.txt"), 'a+', encoding='utf-8') as f:
                        f.write(x.get_text() + '\n') # 写入文件，用于汇总。这里是append模式。
                        f.close()

pathStr='C:\\Users\\这里改成你自己的路径' # 【需要修改时，只需修改这一处str即可。】
files = os.listdir(pathStr) # 得到文件夹下的所有文件名称


for file in files:
    if file[-4:]!=".pdf":  # txt
        continue
    file=file[:-4]          # 去掉  ".pdf"
    # 所以说想重新运行,【只要删除all的txt就可以了.其他会自动的处理的】
    # 【只要删除all的txt就可以了.其他会自动的处理的】
    # 【只要删除all的txt就可以了.其他会自动的处理的】
    if os.path.exists(pathStr+file+".txt"):
        os.remove(pathStr+file+".txt")
    Path = open(pathStr+file+".pdf", 'rb')
    Pdf2Txt(Path, pathStr+file+".txt")
    sumFrequency(pathStr,file)  # 把文件名（无后缀）传递进去

sumFrequency(pathStr,"all")  # 统计所有