Python实践：数据处理与挖掘

2z2z2

于 2024-07-29 13:36:27 发布

阅读量64

点赞数

文章标签： python java 服务器

本文链接：https://blog.csdn.net/2302_82050581/article/details/140769096

版权

请下载文献：Niepel, M., et al. (2017). "Common and cell-type specific responses to anti-cancer drugs revealed by high throughput transcript profiling." Nat Commun 8(1): 1186.去除文章末尾的参考文献，做出该文章的词云图。

import wordcloud
import PyPDF2


def pdf_to_txt():  # PDF文件转TXT文件
    # 创建一个空文本
    text = ""

    # 打开PDF文件
    pdf_file = open("E:/Python Data/s41467-017-01383-w.pdf", "rb")

    # 创建一个PDF对象
    pdf_reader = PyPDF2.PdfReader(pdf_file)

    # 获取PDF文件中的页面数量
    num_pages = len(pdf_reader.pages)

    # 循环遍历每一页并提取文本
    for page_num in range(num_pages):
        page = pdf_reader.pages[page_num]
        text += page.extract_text()

    # 写入txt文件
    txt_writer = open("E:/Python Data/s41467-017-01383-w.txt", "w", encoding="UTF-8")
    txt_writer.write(text)

    # 关闭PDF文件
    pdf_file.close()


def txt_process():  # 文本处理
    t = open("E:/Python Data/s41467-017-01383-w.txt", "r", encoding="UTF-8").read()
    t = t.lower()  # 字母全部小写
    for ch in ",./;[]=<>?:\"{}_|+`!@\\#$%^&*() ":
        t = t.replace(ch, " ")
    t = t.replace("\n", " ").replace("\r", " ").replace("\t", " ").replace(" ﬁ", "fi")  # 将各个单词分割开来
    t = t[: t.find("references")]  # 删除文章末尾的参考文献部分
    return t


def generate_wordclouds(text):
    w = wordcloud.WordCloud(background_color="White", width=2560, height=1440, collocations=False)  # 词云配置
    w.generate(text)
    w.to_file("E:/Python Data/pywordclouods.png")


def main():
    pdf_to_txt()
    text = txt_process()
    print(text)
    generate_wordclouds(text)


main()

运行后的效果如下：

当前目录下有一个文件名为score1.txt的文本文件，存放着某班学生的计算机课成绩，共有学号、平时成绩、期末成绩三列。请根据平时成绩占30%、期末成绩占70%的比例计算总评成绩（取整数），并分学号、总评成绩两列写入另一个文件score2.txt中。同时，在屏幕上输出学生总人数，按总评成绩计90分以上、80-89分、70-79分、60-69分、60分以下各成绩档的人数和班级总平均分（取整数）。

import pandas as pd
import numpy as np


def txt_to_csv():  # 将所给txt文件格式化为标准csv文件格式 
    ifs = open("E:/Python Data/score1.txt", "r", encoding="UTF-8")  # 打开文件
    datas = ifs.read()  # 读取数据
    new_datas = ""  # 创建空字符串
    for line in datas:  # 遍历数据
        line = line.replace("\t", ",")  # 将制表符转换成逗号
        new_datas += line  # 给新数据赋值
    ofs = open("E:/Python Data/score1.csv", "w", encoding="UTF-8")  # 写模式打开文件
    ofs.write(new_datas)  # 写入数据


def data_write():  # 将dataframe数据预处理写入新文件
    df = pd.read_csv("E:/Python Data/score1.csv")  # 读取dataframe
    df.set_index("学号", inplace=True)  # 修改索引为学号
    df["总评成绩"] = df["平时成绩"] * 0.3 + df["期末成绩"] * 0.7  # 算出总评成绩
    df["总评成绩"] = df["总评成绩"].round(0).astype(int)  # 四舍五入转换成整数
    df2 = df.drop(["平时成绩", "期末成绩"], axis=1)  # 删除平时成绩列和期末成绩列
    df2.to_csv("E:/Python Data/score2.csv")  # 写入score2.csv文件
    return df2


def data_process(df):  # 为了满足题干要求
    print(f"学生总人数为：{len(df)}")  # 计算df的行数，也就是班级总人数
    print(f"90分以上的人数有：{(df["总评成绩"] >= 90).sum()}")  # 各分数段计数
    print(f"80-89分的人数有：{((df["总评成绩"] < 90) & (df["总评成绩"] >= 80)).sum()}")
    print(f"70-79分的人数有：{((df["总评成绩"] < 80) & (df["总评成绩"] >= 70)).sum()}")
    print(f"60-69分的人数有：{((df["总评成绩"] < 70) & (df["总评成绩"] >= 60)).sum()}")
    print(f"60分以下的人数有：{(df["总评成绩"] < 60).sum()}")
    print(f"班级平均分为：{df["总评成绩"].mean().round(0).astype(int)}")  # 计算平均值并四舍五入取整


def main():
    txt_to_csv()
    df = data_write()
    data_process(df)


main()

2z2z2

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python实践：数据处理与挖掘

当前目录下有一个文件名为score1.txt的文本文件，存放着某班学生的计算机课成绩，共有学号、平时成绩、期末成绩三列。请根据平时成绩占30%、期末成绩占70%的比例计算总评成绩（取整数），并分学号、总评成绩两列写入另一个文件score2.txt中。同时，在屏幕上输出学生总人数，按总评成绩计90分以上、80-89分、70-79分、60-69分、60分以下各成绩档的人数和班级总平均分（取整数）。
复制链接

扫一扫