用python提取pdf和txt内容统计词频并画出词云图

sinnp

于 2024-04-28 09:25:15 发布

阅读量915

点赞数 2

文章标签： python windows pdf

本文链接：https://blog.csdn.net/sinnp/article/details/138267603

版权

本文介绍了使用Python和PyMuPDF库读取PDF文件中的文本内容，然后利用jieba进行中文分词并统计词频，去除停用词和特定频率范围内的词，最后生成词云图的完整过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

①用python读取pdf文件里的内容

def read_pdf(path_list):
    text=""
    for file_path in path_list:
        doc = fitz.open(file_path)   
        for page in doc:  
            text += page.get_text()
        
        # 有得pdf读取不了 文字 所以说读取不到 文字时候 报错
        if(text==""):
            raise Exception("{file_path}的内容为空!")
    return text

②读取完内容后，用jieba分词然后统计频次


def preprocess_text(text):  
    text = re.sub(r'[^\u4e00-\u9fa5]', '', text)
    words = jieba.cut(text)
    
    # jieba分割完词后统计频率
    fenge = {}
    for i in words:
        if i not in fenge:
            fenge[i]=1
        else:
            fenge[i]+=1

    
    # 以下是 处理

    # 对频率进行排序
    words = sorted(fenge.items(),key=lambda x:x[1],reverse=True)
    
    out = {}
    for word,count in words:
        out[word]=co