python编程数据分析 pandas数据统计

目录

一:数据集准备

二:加载文件

三:分组操作进行统计


一:数据集准备

可以创建一个txt,并放置pycharm工程目录下

下面是博主的数据集测试,所用数据,需要的自取

1001,Chinese,1,80
1001,Chinese,2,81
1001,Chinese,3,79
1001,Chinese,4,86
1001,Math,1,69
1001,Math,2,70
1001,Math,3,79
1001,Math,4,90
1001,English,1,90
1001,English,2,89
1001,English,3,92
1001,English,4,96
1002,Chinese,1,85
1002,Chinese,2,86
1002,Chinese,3,88
1002,Chinese,4,71
1002,Math,1,63
1002,Math,2,96
1002,Math,3,68
1002,Math,4,72
1002,English,1,63
1002,English,2,93
1002,English,3,86
1002,English,4,75
1003,Chinese,1,87
1003,Chinese,2,81
1003,Chinese,3,82
1003,Chinese,4,77
1003,Math,1,69
1003,Math,2,91
1003,Math,3,61
1003,Math,4,79
1003,English,1,68
1003,English,2,82
1003,English,3,87
1003,English,4,96
1004,Chinese,1,81
1004,Chinese,2,77
1004,Chinese,3,92
1004,Chinese,4,68
1004,Math,1,96
1004,Math,2,85
1004,Math,3,85
1004,Math,4,74
1004,English,1,67
1004,English,2,63
1004,English,3,96
1004,English,4,77

二:加载文件

加载文件 read_csv 

import numpy as np
import pandas as pd

# 列名
columns = ['sno', 'subject', 'unit', 'score']
# 1 加载文件
df = pd.read_csv("student.txt", sep=',', names=columns)
print(df.head())
print(df.shape)
    sno  subject  unit  score
0  1001  Chinese     1     80
1  1001  Chinese     2     81
2  1001  Chinese     3     79
3  1001  Chinese     4     86
4  1001     Math     1     69
(48, 4)

如上结果

加载读取数据集head 头部数据,

shape 规格形状  为4列,分别为学号、科目、次数、分数 

三:分组操作进行统计

1 按照科目分组

# 根据科目进行分组
df_subject = df.groupby('subject')
for i in df_subject:
    print(i)

import numpy as np
import pandas as pd

# 列名
columns = ['sno', 'subject', 'unit', 'score']
# 1 加载文件
df = pd.read_csv("student.txt", sep=',', names=columns, index_col=0)
# 分组操作进行统计
# 根据科目进行分组、只获取分数
df_subject = df.groupby('subject')
for i in df_subject:
    print(i)
('Chinese',       subject  unit  score
sno                       
1001  Chinese     1     80
1001  Chinese     2     81
1001  Chinese     3     79
1001  Chinese     4     86
1002  Chinese     1     85
1002  Chinese     2     86
1002  Chinese     3     88
1002  Chinese     4     71
1003  Chinese     1     87
1003  Chinese     2     81
1003  Chinese     3     82
1003  Chinese     4     77
1004  Chinese     1     81
1004  Chinese     2     77
1004  Chinese     3     92
1004  Chinese     4     68)
('English',       subject  unit  score
sno                       
1001  English     1     90
1001  English     2     89
1001  English     3     92
1001  English     4     96
1002  English     1     63
1002  English     2     93
1002  English     3     86
1002  English     4     75
1003  English     1     68
1003  English     2     82
1003  English     3     87
1003  English     4     96
1004  English     1     67
1004  English     2     63
1004  English     3     96
1004  English     4     77)
('Math',      subject  unit  score
sno                      
1001    Math     1     69
1001    Math     2     70
1001    Math     3     79
1001    Math     4     90
1002    Math     1     63
1002    Math     2     96
1002    Math     3     68
1002    Math     4     72
1003    Math     1     69
1003    Math     2     91
1003    Math     3     61
1003    Math     4     79
1004    Math     1     96
1004    Math     2     85
1004    Math     3     85
1004    Math     4     74)

2 各个科目只取分数 

# 根据科目进行分组只获取分数
df_subject = df.groupby('subject')['score']
for i in df_subject:
    print(i)

import numpy as np
import pandas as pd

# 列名
columns = ['sno', 'subject', 'unit', 'score']
# 1 加载文件
df = pd.read_csv("student.txt", sep=',', names=columns, index_col=0)
# 分组操作进行统计
# 根据科目进行分组、只获取分数
df_subject = df.groupby('subject')['score']
for i in df_subject:
    print(i)
('Chinese', sno
1001    80
1001    81
1001    79
1001    86
1002    85
1002    86
1002    88
1002    71
1003    87
1003    81
1003    82
1003    77
1004    81
1004    77
1004    92
1004    68
Name: score, dtype: int64)
('English', sno
1001    90
1001    89
1001    92
1001    96
1002    63
1002    93
1002    86
1002    75
1003    68
1003    82
1003    87
1003    96
1004    67
1004    63
1004    96
1004    77
Name: score, dtype: int64)
('Math', sno
1001    69
1001    70
1001    79
1001    90
1002    63
1002    96
1002    68
1002    72
1003    69
1003    91
1003    61
1003    79
1004    96
1004    85
1004    85
1004    74
Name: score, dtype: int64)

3 各个科目的平均成绩

df_subject_mean = df.groupby('subject')['score'].mean()
print(df_subject_mean)

import numpy as np
import pandas as pd

# 列名
columns = ['sno', 'subject', 'unit', 'score']
# 1 加载文件
df = pd.read_csv("student.txt", sep=',', names=columns, index_col=0)
# 分组操作进行统计
# 根据科目进行分组、只获取分数
# df_subject = df.groupby('subject')['score']
# for i in df_subject:
#     print(i)

df_subject_mean = df.groupby('subject')['score'].mean()
print(df_subject_mean)
subject
Chinese    81.3125
English    82.5000
Math       77.9375
Name: score, dtype: float64

4 各个同学,各个科目的平均成绩
# 分组因素 学号 科目
df_mean = df.groupby(['sno', 'subject'])['score'].mean()
print(df_mean)

import numpy as np
import pandas as pd

# 列名
columns = ['sno', 'subject', 'unit', 'score']
# 1 加载文件
df = pd.read_csv("student.txt", sep=',', names=columns, index_col=0)
# 分组操作进行统计
# 每个同学 每个科目的平均成绩
# 分组因素 学号 科目
df_mean = df.groupby(['sno', 'subject'])['score'].mean()
print(df_mean)

sno   subject
1001  Chinese    81.50
      English    91.75
      Math       77.00
1002  Chinese    82.50
      English    79.25
      Math       74.75
1003  Chinese    81.75
      English    83.25
      Math       75.00
1004  Chinese    79.50
      English    75.75
      Math       85.00
Name: score, dtype: float64
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
本文介绍的是利用Python语言,做成绩分析并生成成绩分析动态图表。Python语言可以利用Pandas、Pyecharts等各种类库,进行数据分析。 本文介绍的成绩分析大体分为三步: 一、拼合单科成绩,合成学年成绩,计算总分,按总分成绩排名次,然后由学年成绩筛选出各个班级的成绩,将学年成绩,各班级成绩存入一个Excel文件中,工作表分别命名为学年成绩,高三(1)班……等 二、利用生成的第一步生成的Excel文件,做成绩分析,保存成绩分析表格。 三、利用成绩分析表格,做成绩分析动态图。 下面是部分源代码: 1、成绩整理与合并 import glob import os import pandas as pd from functools import reduce inputPath="./原始成绩/" writer_lk = pd.ExcelWriter('./整理后的成绩/2020一模理科总成绩及各班级成绩.xlsx') writer_wk = pd.ExcelWriter('./整理后的成绩/2020一模文科总成绩及各班级成绩.xlsx') inputWorkbook=glob.glob(os.path.join(inputPath,"*.xls")) #====================读取全部学生的所有科目成绩=================================== yw_score = pd.read_excel(inputWorkbook[2]) sxlk_score = pd.read_excel(inputWorkbook[1]) sxwk_score = pd.read_excel(inputWorkbook[0]) yy_score = pd.read_excel(inputWorkbook[5]) yy_score['英语'] = (yy_score['英语'] * 1.25).round(0)#英语成绩不计算听力成绩*1.25 lkzh_score = pd.read_excel(inputWorkbook[4]) wkzh_score = pd.read_excel(inputWorkbook[3]) #======================================================================= #====================整理出理科成绩及分班成绩、计算总分、总分排名、班级排名============================= lk_class = ['高三(1)班','高三(2)班','高三(3)班','高三(4)班'] wk_class = ['高三(5)班','高三(6)班'] lk_yw = yw_score.loc[(yw_score.班级.isin(lk_class)), ['班级','姓名','语文']] lk_sx = sxlk_score[['姓名','数学']] lk_yy = yy_score.loc[(yy_score.班级.isin(lk_class)), ['姓名','英语']] lk_k3 = lkzh_score[['姓名','物理','化学','生物','理综']] lk_list = [lk_yw, lk_sx, lk_yy, lk_k3] score_lk = (reduce(lambda left, right: pd.merge(left, right, on='姓名'), lk_list)) score_lk['总分'] = (score_lk['语文'] + score_lk['数学'] + score_lk['英语'] + score_lk['理综']).round(0) def sort_grade(score): score_sort = score.sort_values(by=['总分'], ascending=False) score_sort['年级排名'] = score_sort['总分'].rank(ascending=0,method='min') return score_sort def sort_class_lk(score_garde,name): class_sort = score_garde.loc[score_garde.班级 == name, :] class_sort = class_sort.sort_values(by=['总分'], ascending=False) class_sort['班级排名'] = class_sort['总分'].rank(ascending=0,method='min') class_sort.to_excel(writer_lk, index=None, sheet_name=name) lk_grade_sort = sort_grade(score_lk) lk_grade_sort.to_excel(writer_lk, index=None, sheet_name='学年成绩') for lk in lk_class: class_sort = sort_class_lk(score_lk, lk) writer_lk.save() writer_lk.close() # #============整理出文科成绩及分班成绩、计算总分、总分排名、班级排名================== wk_yw = yw_score.loc[(yw_score.班级.isin(wk_class)), ['班级','姓名','语文']] wk_sx = sxwk_score[['姓名','数学']] wk_yy = yy_score.loc[(yy_score.班级.isin(wk_class)), ['姓名','英语']] wk_k3 = wkzh_score[['姓名','政治','历史','地理','文综']] wk_list = [wk_yw, wk_sx, wk_yy, wk_k3] score_wk = (reduce(lambda left, right: pd.merge(left, right, on='姓名'), wk_list)) score_wk['总分'] = (score_wk['语文'] + score_wk['数学'] + score_wk['英语'] + score_wk['文综']).round(0) def sort_class_wk(score_garde,name): class_sort = score_garde.loc[score_garde.班级 == name, :] class_sort = class_sort.sort_values(by=['总分'], ascending=False) class_sort['班级排名'] = class_sort['总分'].rank(ascending=0,method='min') class_sort.to_excel(writer_wk, index=None, sheet_name=name) wk_grade_sort = sort_grade(score_wk) wk_grade_sort.to_excel(writer_wk, index=None, sheet_name='学年成绩') for wk in wk_class: class_sort = sort_class_wk(wk_grade_sort, wk) writer_wk.save() writer_wk.close() 2、成绩区间分割与统计 #coding:utf-8 import numpy as np import pandas as pd from functools import reduce fpath_lk="./整理后的成绩/2020一模理科总成绩及各班级成绩.xlsx" fpath_wk="./整理后的成绩/2020一模文科总成绩及各班级成绩.xlsx" writer_lk = pd.ExcelWriter('./整理后的成绩/2020一模理科成绩区间分布统计.xlsx') writer_wk = pd.ExcelWriter('./整理后的成绩/2020一模文科成绩区间分布统计.xlsx') lk = pd.read_excel(fpath_lk, None) #获取表格中的所有工作表的内容 wk = pd.read_excel(fpath_wk, None) #===================1.定义区间分割函数===================================== def cut_750(score_750,len): bins_750= [0,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590,600,620,640,660,750] labels_750 = ['0-370','370-379','380-389','390-399','400-409','410-419','420-429','430-439','440-449','450-459','460-469','470-479','480-489','490-499','500-509','510-519','520-529','530-539','540-549','550-559','560-569','570-579','580-589','590-599','600-619','620-639','640-659','660-750'] cut_750 = pd.cut(score_750, bins_750, labels=labels_750, right=False) qj = pd.DataFrame({'区间':pd.value_counts(cut_750).index,'人数':pd.value_counts(cut_750),'百分比':((pd.value_counts(cut_750))/len).round(3).apply(lambda x: format(x, '.2%'))}).sort_values(by='区间', ascending=False) qj = qj.reset_index(drop=True) return qj def cut_150(score_150,len): bins_150 = [0,30,60,90,120,150] labels_150 = ['0-30', '30-60', '60-90', '90-120', '120-150'] cut_150 = pd.cut(score_150, bins_150, labels=labels_150, right=False) qj = pd.DataFrame({'区间':pd.value_counts(cut_150).index,'人数':pd.value_counts(cut_150),'百分比':((pd.value_counts(cut_150))/len).round(3).apply(lambda x: format(x, '.2%'))}).sort_values(by='区间') 其他源代码及始数据已上传,欢迎各位借鉴,第一次编程,希望网友们能指点不足之处,联系qq:912182988
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

chenruhan_QAQ_

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值