python成绩分析器_Python-根据成绩分析是否继续深造

本文使用Python对包含GRE、TOEFL成绩等特征的学生数据进行分析,评估升学可能性。研究发现GRE、CGPA和TOEFL成绩对是否继续深造影响最大。数据可视化显示,进行研究的人数、GRE分数分布、CGPA与学校等级、GRE与CGPA、托福成绩与GRE成绩的关系。通过线性回归、随机森林和决策树回归模型比较,线性回归表现最佳。逻辑回归、随机森林和决策树分类器也进行了对比,随机森林和逻辑回归预测准确率较高。最后,应用K均值和层次聚类进行数据分组,得出相似结论。
摘要由CSDN通过智能技术生成

案例:该数据集的是一个关于每个学生成绩的数据集,接下来我们对该数据集进行分析,判断学生是否适合继续深造

数据集特征展示

1 GRE 成绩 (290 to 340)2 TOEFL 成绩(92 to 120)3 学校等级 (1 to 5)4 自身的意愿 (1 to 5)5 推荐信的力度 (1 to 5)6 CGPA成绩 (6.8 to 9.92)7 是否有研习经验 (0 or 1)8 读硕士的意向 (0.34 to 0.97)

1.导入包

importnumpy as npimportpandas as pdimportmatplotlib.pyplot as pltimportseaborn as snsimport os,sys

2.导入并查看数据集

df = pd.read_csv("D:\\machine-learning\\score\\Admission_Predict.csv",sep = ",")

print('There are ',len(df.columns),'columns')

for c in df.columns:

sys.stdout.write(str(c)+', '

There are 9 columns

Serial No., GRE Score, TOEFL Score, University Rating, SOP, LOR , CGPA, Research, Chance of Admit ,

一共有9列特征

df.info()

RangeIndex: 400 entries, 0 to 399

Data columns (total 9 columns):

Serial No. 400 non-null int64

GRE Score 400 non-null int64

TOEFL Score 400 non-null int64

University Rating 400 non-null int64

SOP 400 non-null float64

LOR 400 non-null float64

CGPA 400 non-null float64

Research 400 non-null int64

Chance of Admit 400 non-null float64

dtypes: float64(4), int64(5)

memory usage: 28.2 KB

数据集信息:

1.数据有9个特征,分别是学号,GRE分数,托福分数,学校等级,SOP,LOR,CGPA,是否参加研习,进修的几率

2.数据集中没有空值

3.一共有400条数据

#整理列名称

df = df.rename(columns={'Chance of Admit':'Chance of Admit'})

#显示前5列数据

df.head()

3.查看每个特征的相关性

fig,ax = plt.subplots(figsize=(10,10))

sns.heatmap(df.corr(),ax=ax,annot=True,linewidths=0.05,fmt='.2f',cmap='magma')

plt.show()

结论:1.最有可能影响是否读硕士的特征是GRE,CGPA,TOEFL成绩

2.影响相对较小的特征是LOR,SOP,和Research

4.数据可视化,双变量分析

4.1 进行Research的人数

print("Not Having Research:",len(df[df.Research ==0]))print("Having Research:",len(df[df.Research == 1]))

y= np.array([len(df[df.Research == 0]),len(df[df.Research == 1])])

x= np.arange(2)

plt.bar(x,y)

plt.title("Research Experience")

plt.xlabel("Canditates")

plt.ylabel("Frequency")

plt.xticks(x,('Not having research','Having research'))

plt.show()

结论:进行research的人数是219,本科没有research人数是181

4.2 学生的托福成绩

y = np.array([df['TOEFL Score'].min(),df['TOEFL Score'].mean(),df['TOEFL Score'].max()])

x= np.arange(3)

plt.bar(x,y)

plt.title('TOEFL Score')

plt.xlabel('Level')

plt.ylabel('TOEFL Score')

plt.xticks(x,('Worst','Average','Best'))

plt.show()

结论:最低分92分,最高分满分,进修学生的英语成绩很不错

4.3 GRE成绩

df['GRE Score'].plot(kind='hist',bins=200,figsize=(6,6))

plt.title('GRE Score')

plt.xlabel('GRE Score')

plt.ylabel('Frequency')

plt.show()

结论:310和330的分值的学生居多

4.4 CGPA和学校等级的关系

plt.scatter(df['University Rating'],df['CGPA'])

plt.title('CGPA Scores for University ratings')

plt.xlabel('University Rating')

plt.ylabel('CGPA')

plt.show()

结论:学校越好,学生的GPA可能就越高

4.5 GRE成绩和CGPA的关系

plt.scatter(df['GRE Score'],df['CGPA'])

plt.title('CGPA for GRE Scores')

plt.xlabel('GRE Score')

plt.ylabel('CGPA')

plt.show()

结论:GPA基点越高,GRE分数越高,2者的相关性很大

4.6 托福成绩和GRE成绩的关系

df[df['CGPA']>=8.5].plot(kind='scatter',x='GRE Score',y='TOEFL Score',color='red')

plt.xlabel('GRE Score')

plt.ylabel('TOEFL Score')

plt.title('CGPA >= 8.5')

plt.grid(True)

plt.show()

结论:多数情况下GRE和托福成正相关,但是GRE分数高,托福一定高。

4.6 学校等级和是否读硕士的关系

s = df[df['Chance of Admit'] >= 0.75]['University Rating'].value_counts().head(5)

plt.title('University Ratings of Candidates with an 75% acceptance chance')

s.plot(kind='bar',figsize=(20,10),cmap='Pastel1')

plt.xlabel('University Rating')

plt.ylabel('Candidates')

plt.show()

结论:排名靠前的学校的学生,进修的可能性更大

4.7 SOP和GPA的关系

plt.scatter(df['CGPA'],df['SOP'])

plt.xlabel('CGPA')

plt.ylabel('SOP')

plt.title('SOP for CGPA')

plt.show()

结论: GPA很高的学生,选择读硕士的自我意愿更强烈

4.8 SOP和GRE的关系

plt.scatter(df['GRE Score'],df['SOP'])

plt.xlabel('GRE Score')

plt.ylabel('SOP')

plt.title('SOP for GRE Score')

plt.show()

结论:读硕士意愿强的学生,GRE分数较高

5.模型

5.1 准备数据集

#读取数据集

df = pd.read_csv('D:\\machine-learning\\score\\Admission_Predict.csv',sep=',')

serialNO= df['Serial No.'].values

df.drop(['Serial No.'],axis=1,inplace=True)

df= df.rename(columns={'Chance of Admit':'Chance of Admit'})#分割数据集

y = df['Chance of Admit'].values

x= df.drop(['Chance of Admit'],axis=1)from sklearn.model_selection importtrain_test_split

x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.2,random_state=42)

#归一化数据

from sklearn.preprocessing import MinMaxScaler

scaleX = MinMaxScaler(feature_range=[0,1])

x_train[x_train.columns] = scaleX.fit_transform(x_train[x_train.columns])

x_test[x_test.columns] = scaleX.fit_transform(x_test[x_test.columns])

5.2 回归

5.2.1 线性回归

from sklearn.linear_model importLinearRegression

lr=LinearRegression()

lr.fit(x_train,y_train)

y_head_lr=lr.predict(x_test)print('Real value of y_test[1]:'+str(y_test[1]) + '-> predict value:' + str(lr.predict(x_test.iloc[[1],:])))print('Real value of y_test[2]:'+str(y_test[2]) + '-> predict value:' + str(lr.predict(x_test.iloc[[2],:])))from sklearn.metrics importr2_scoreprint('r_square score:',r2_score(y_test,y_head_lr))

y_head_lr_train=lr.predict(x_train)print('r_square score(train data):',r2_score(y_train,y_head_lr_train))

5.2.2 随机森林回归

from sklearn.ensemble importRandomForestRegressor

rfr= RandomForestRegressor(n_estimators=100,random_state=42)

rfr.fit(x_train,y_train)

y_head_rfr=rfr.predict(x_test)print('Real value of y_test[1]:'+str(y_test[1]) + '-> predict value:' + str(rfr.predict(x_test.iloc[[1],:])))print('Real value of y_test[2]:'+str(y_test[2]) + '-> predict value:' + str(rfr.predict(x_test.iloc[[2],:])))from sklearn.metrics importr2_scoreprint('r_square score:',r2_score(y_test,y_head_rfr))

y_head_rfr_train=rfr.predict(x_train)print('r_square score(train data):',r2_score(y_train,y_head_rfr_train))

5.2.3 决策树回归

from sklearn.tree importDecisionTreeRegressor

dt= DecisionTreeRegressor(random_state=42)

dt.fit(x_train,y_train)

y_head_dt=dt.predict(x_test)print('Real value of y_test[1]:'+str(y_test[1]) + '-> predict value:' + str(dt.predict(x_test.iloc[[1],:])))print('Real value of y_test[2]:'+str(y_test[2]) + '-> predict value:' + str(dt.predict(x_test.iloc[[2],:])))from sklearn.metrics importr2_scoreprint('r_square score:',r2_score(y_test,y_head_dt))

y_head_dt_train=dt.predict(x_train)print('r_square score(train data):',r2_score(y_train,y_head_dt_train))

5.2.4 三种回归方法比较

y =np.array([r2_score(y_test,y_head_lr),r2_score(y_test,y_head_rfr),r2_score(y_test,y_head_dt)])

x= np.arange(3)

plt.bar(x,y)

plt.title('Comparion of Regression Algorithms')

plt.xlabel('Regression')

plt.ylabel('r2_score')

plt.xticks(x,("LinearRegression","RandomForestReg.","DecisionTreeReg."))

plt.show()

结论 : 回归算法中,线性回归的性能更优

5.2.5 三种回归方法与实际值的比较

​red = plt.scatter(np.arange(0,80,5),y_head_lr[0:80:5],color='red')

blue= plt.scatter(np.arange(0,80,5),y_head_rfr[0:80:5],color='blue')

green= plt.scatter(np.arange(0,80,5),y_head_dt[0:80:5],color='green')

black= plt.scatter(np.arange(0,80,5),y_test[0:80:5],color='black')

plt.title('Comparison of Regression Algorithms')

plt.xlabel('Index of candidate')

plt.ylabel('Chance of admit')

plt.legend([red,blue,green,black],['LR','RFR','DT','REAL'])

plt.show()

结论:在数据集中有70%的候选人有可能读硕士,从上图来看还有些点没有很好的得到预测

5.3 分类算法

5.3.1 准备数据

df = pd.read_csv('D:\\machine-learning\\score\\Admission_Predict.csv',sep=',')

SerialNO= df['Serial No.'].values

df.drop(['Serial No.'],axis=1,inplace=True)

df= df.rename(columns={'Chance of Admit':'Chance of Admit'})

y= df['Chance of Admit'].values

x= df.drop(['Chance of Admit'],axis=1)from sklearn.model_selection importtrain_test_split

x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.2,random_state=42)from sklearn.preprocessing importMinMaxScaler

scaleX= MinMaxScaler(feature_range=[0,1])

x_train[x_train.columns]=scaleX.fit_transform(x_train[x_train.columns])

x_test[x_test.columns]=scaleX.fit_transform(x_test[x_test.columns])#如果chance >0.8, chance of admit 就是1,否则就是0

y_train_01 = [1 if each > 0.8 else 0 for each iny_train]

y_test_01= [1 if each > 0.8 else 0 for each iny_test]

y_train_01=np.array(y_train_01)

y_test_01= np.array(y_test_01)

5.3.2 逻辑回归

from sklearn.linear_model importLogisticRegression

lrc=LogisticRegression()

lrc.fit(x_train,y_train_01)print('score:',lrc.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(lrc.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(lrc.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_lrc=confusion_matrix(y_test_01,lrc.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_lrc,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,lrc.predict(x_test)))print('recall_score is :',recall_score(y_test_01,lrc.predict(x_test)))print('f1_score is :',f1_score(y_test_01,lrc.predict(x_test)))#Test for Train Dataset:

cm_lrc_train=confusion_matrix(y_train_01,lrc.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_lrc_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,逻辑回归算法在训练集样本上,有23个分错的样本,有72人想进一步读硕士

2.在测试集上有7个分错的样本

5.3.3 支持向量机(SVM)

from sklearn.svm importSVC

svm= SVC(random_state=1,kernel='rbf')

svm.fit(x_train,y_train_01)print('score:',svm.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(svm.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(svm.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_svm=confusion_matrix(y_test_01,svm.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_svm,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,svm.predict(x_test)))print('recall_score is :',recall_score(y_test_01,svm.predict(x_test)))print('f1_score is :',f1_score(y_test_01,svm.predict(x_test)))#Test for Train Dataset:

cm_svm_train=confusion_matrix(y_train_01,svm.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_svm_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,SVM算法在训练集样本上,有22个分错的样本,有70人想进一步读硕士

2.在测试集上有8个分错的样本

5.3.4 朴素贝叶斯

from sklearn.naive_bayes importGaussianNB

nb=GaussianNB()

nb.fit(x_train,y_train_01)print('score:',nb.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(nb.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(nb.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_nb=confusion_matrix(y_test_01,nb.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_nb,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,nb.predict(x_test)))print('recall_score is :',recall_score(y_test_01,nb.predict(x_test)))print('f1_score is :',f1_score(y_test_01,nb.predict(x_test)))#Test for Train Dataset:

cm_nb_train=confusion_matrix(y_train_01,nb.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_nb_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,朴素贝叶斯算法在训练集样本上,有20个分错的样本,有78人想进一步读硕士

2.在测试集上有7个分错的样本

5.3.5 随机森林分类器

from sklearn.ensemble importRandomForestClassifier

rfc= RandomForestClassifier(n_estimators=100,random_state=1)

rfc.fit(x_train,y_train_01)print('score:',rfc.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(rfc.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(rfc.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_rfc=confusion_matrix(y_test_01,rfc.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_rfc,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,rfc.predict(x_test)))print('recall_score is :',recall_score(y_test_01,rfc.predict(x_test)))print('f1_score is :',f1_score(y_test_01,rfc.predict(x_test)))#Test for Train Dataset:

cm_rfc_train=confusion_matrix(y_train_01,rfc.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_rfc_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,随机森林算法在训练集样本上,有0个分错的样本,有88人想进一步读硕士

2.在测试集上有5个分错的样本

5.3.6 决策树分类器

from sklearn.tree importDecisionTreeClassifier

dtc= DecisionTreeClassifier(criterion='entropy',max_depth=3)

dtc.fit(x_train,y_train_01)print('score:',dtc.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(dtc.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(dtc.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_dtc=confusion_matrix(y_test_01,dtc.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_dtc,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,dtc.predict(x_test)))print('recall_score is :',recall_score(y_test_01,dtc.predict(x_test)))print('f1_score is :',f1_score(y_test_01,dtc.predict(x_test)))#Test for Train Dataset:

cm_dtc_train=confusion_matrix(y_train_01,dtc.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_dtc_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,决策树算法在训练集样本上,有20个分错的样本,有78人想进一步读硕士

2.在测试集上有7个分错的样本

5.3.7 K临近分类器

from sklearn.neighbors importKNeighborsClassifier

scores=[]for each in range(1,50):

knn_n= KNeighborsClassifier(n_neighbors =each)

knn_n.fit(x_train,y_train_01)

scores.append(knn_n.score(x_test,y_test_01))

plt.plot(range(1,50),scores)

plt.xlabel('k')

plt.ylabel('Accuracy')

plt.show()

knn= KNeighborsClassifier(n_neighbors=7)

knn.fit(x_train,y_train_01)print('score 7 :',knn.score(x_test,y_test_01))print('Real value of y_test_01[1]:'+str(y_test_01[1]) + '-> predict value:' + str(knn.predict(x_test.iloc[[1],:])))print('Real value of y_test_01[2]:'+str(y_test_01[2]) + '-> predict value:' + str(knn.predict(x_test.iloc[[2],:])))from sklearn.metrics importconfusion_matrix

cm_knn=confusion_matrix(y_test_01,knn.predict(x_test))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_knn,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax)

plt.title('Test for Test dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()from sklearn.metrics importrecall_score,precision_score,f1_scoreprint('precision_score is :',precision_score(y_test_01,knn.predict(x_test)))print('recall_score is :',recall_score(y_test_01,knn.predict(x_test)))print('f1_score is :',f1_score(y_test_01,knn.predict(x_test)))#Test for Train Dataset:

cm_knn_train=confusion_matrix(y_train_01,knn.predict(x_train))

f,ax= plt.subplots(figsize=(5,5))

sns.heatmap(cm_knn_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax)

plt.title('Test for Train dataset')

plt.xlabel('predicted y values')

plt.ylabel('real y value')

plt.show()

结论:1.通过混淆矩阵,K临近算法在训练集样本上,有22个分错的样本,有71人想进一步读硕士

2.在测试集上有7个分错的样本

5.3.8 分类器比较

y =np.array([lrc.score(x_test,y_test_01),svm.score(x_test,y_test_01),nb.score(x_test,y_test_01),

dtc.score(x_test,y_test_01),rfc.score(x_test,y_test_01),knn.score(x_test,y_test_01)])

x= np.arange(6)

plt.bar(x,y)

plt.title('Comparison of Classification Algorithms')

plt.xlabel('Classification')

plt.ylabel('Score')

plt.xticks(x,("LogisticReg.","SVM","GNB","Dec.Tree","Ran.Forest","KNN"))

plt.show()

结论:随机森林和朴素贝叶斯二者的预测值都比较高

5.4 聚类算法

5.4.1 准备数据

df = pd.read_csv('D:\\machine-learning\\score\\Admission_Predict.csv',sep=',')

df= df.rename(columns={'Chance of Admit':'Chance of Admit'})

serialNo= df['Serial No.']

df.drop(['Serial No.'],axis=1,inplace=True)

df= (df - np.min(df)) / (np.max(df)-np.min(df))

y= df['Chance of Admit']

x= df.drop(['Chance of Admit'],axis=1)

5.4.2 降维

from sklearn.decomposition importPCA

pca= PCA(n_components=1,whiten=True)

pca.fit(x)

x_pca=pca.transform(x)

x_pca= x_pca.reshape(400)

dictionary= {'x':x_pca,'y':y}

data=pd.DataFrame(dictionary)print('pca data:',data.head())print()print('orin data:',df.head())

5.4.3 K均值聚类

from sklearn.cluster importKMeans

wcss=[]for k in range(1,15):

kmeans= KMeans(n_clusters=k)

kmeans.fit(x)

wcss.append(kmeans.inertia_)

plt.plot(range(1,15),wcss)

plt.xlabel('Kmeans')

plt.ylabel('WCSS')

plt.show()

df["Serial No."] =serialNo

kmeans= KMeans(n_clusters=3)

clusters_knn=kmeans.fit_predict(x)

df['label_kmeans'] =clusters_knn

plt.scatter(df[df.label_kmeans== 0 ]["Serial No."],df[df.label_kmeans == 0]['Chance of Admit'],color = "red")

plt.scatter(df[df.label_kmeans== 1 ]["Serial No."],df[df.label_kmeans == 1]['Chance of Admit'],color = "blue")

plt.scatter(df[df.label_kmeans== 2 ]["Serial No."],df[df.label_kmeans == 2]['Chance of Admit'],color = "green")

plt.title("K-means Clustering")

plt.xlabel("Candidates")

plt.ylabel("Chance of Admit")

plt.show()

plt.scatter(data.x[df.label_kmeans== 0 ],data[df.label_kmeans == 0].y,color = "red")

plt.scatter(data.x[df.label_kmeans== 1 ],data[df.label_kmeans == 1].y,color = "blue")

plt.scatter(data.x[df.label_kmeans== 2 ],data[df.label_kmeans == 2].y,color = "green")

plt.title("K-means Clustering")

plt.xlabel("X")

plt.ylabel("Chance of Admit")

plt.show()

结论:数据集分成三个类别,一部分学生是决定继续读硕士,一部分放弃,还有一部分学生的比较犹豫,但是深造的可能性较大

5.4.4 层次聚类

from scipy.cluster.hierarchy importlinkage,dendrogram

merg= linkage(x,method='ward')

dendrogram(merg,leaf_rotation=90)

plt.xlabel('data points')

plt.ylabel('euclidean distance')

plt.show()from sklearn.cluster importAgglomerativeClustering

hiyerartical_cluster= AgglomerativeClustering(n_clusters=3,affinity='euclidean',linkage='ward')

clusters_hiyerartical=hiyerartical_cluster.fit_predict(x)

df['label_hiyerartical'] =clusters_hiyerartical

plt.scatter(df[df.label_hiyerartical== 0 ]["Serial No."],df[df.label_hiyerartical == 0]['Chance of Admit'],color = "red")

plt.scatter(df[df.label_hiyerartical== 1 ]["Serial No."],df[df.label_hiyerartical == 1]['Chance of Admit'],color = "blue")

plt.scatter(df[df.label_hiyerartical== 2 ]["Serial No."],df[df.label_hiyerartical == 2]['Chance of Admit'],color = "green")

plt.title('Hierarchical Clustering')

plt.xlabel('Candidates')

plt.ylabel('Chance of Admit')

plt.show()

plt.scatter(data[df.label_hiyerartical== 0].x,data.y[df.label_hiyerartical==0],color='red')

plt.scatter(data[df.label_hiyerartical== 1].x,data.y[df.label_hiyerartical==1],color='blue')

plt.scatter(data[df.label_hiyerartical== 2].x,data.y[df.label_hiyerartical==2],color='green')

plt.title('Hierarchical Clustering')

plt.xlabel('X')

plt.ylabel('Chance of Admit')

plt.show()

结论:从层次聚类的结果中,可以看出和K均值聚类的结果一致,只不过确定了聚类k的取值3

结论:通过本词入门数据集的训练,可以掌握

1.一些特征的展示的方法

2.如何调用sklearn 的API

3.如何取比较不同模型之间的好坏

代码+数据集:https://github.com/Mounment/python-data-analyze/tree/master/kaggle/score

如果有用的话,记得打一个星星,谢谢

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
本文介绍的是利用Python语言,做成绩分析并生成成绩分析动态图表。Python语言可以利用Pandas、Pyecharts等各种类库,进行数据分析。 本文介绍的成绩分析大体为三步: 一、拼合单科成绩,合成学年成绩,计算总,按总成绩排名次,然后由学年成绩筛选出各个班级的成绩,将学年成绩,各班级成绩存入一个Excel文件中,工作表别命名为学年成绩,高三(1)班……等 二、利用生成的第一步生成的Excel文件,做成绩分析,保存成绩分析表格。 三、利用成绩分析表格,做成绩分析动态图。 下面是部源代码: 1、成绩整理与合并 import glob import os import pandas as pd from functools import reduce inputPath="./原始成绩/" writer_lk = pd.ExcelWriter('./整理后的成绩/2020一模理科总成绩及各班级成绩.xlsx') writer_wk = pd.ExcelWriter('./整理后的成绩/2020一模文科总成绩及各班级成绩.xlsx') inputWorkbook=glob.glob(os.path.join(inputPath,"*.xls")) #====================读取全部学生的所有科目成绩=================================== yw_score = pd.read_excel(inputWorkbook[2]) sxlk_score = pd.read_excel(inputWorkbook[1]) sxwk_score = pd.read_excel(inputWorkbook[0]) yy_score = pd.read_excel(inputWorkbook[5]) yy_score['英语'] = (yy_score['英语'] * 1.25).round(0)#英语成绩不计算听力成绩*1.25 lkzh_score = pd.read_excel(inputWorkbook[4]) wkzh_score = pd.read_excel(inputWorkbook[3]) #======================================================================= #====================整理出理科成绩成绩、计算总、总排名、班级排名============================= lk_class = ['高三(1)班','高三(2)班','高三(3)班','高三(4)班'] wk_class = ['高三(5)班','高三(6)班'] lk_yw = yw_score.loc[(yw_score.班级.isin(lk_class)), ['班级','姓名','语文']] lk_sx = sxlk_score[['姓名','数学']] lk_yy = yy_score.loc[(yy_score.班级.isin(lk_class)), ['姓名','英语']] lk_k3 = lkzh_score[['姓名','物理','化学','生物','理综']] lk_list = [lk_yw, lk_sx, lk_yy, lk_k3] score_lk = (reduce(lambda left, right: pd.merge(left, right, on='姓名'), lk_list)) score_lk['总'] = (score_lk['语文'] + score_lk['数学'] + score_lk['英语'] + score_lk['理综']).round(0) def sort_grade(score): score_sort = score.sort_values(by=['总'], ascending=False) score_sort['年级排名'] = score_sort['总'].rank(ascending=0,method='min') return score_sort def sort_class_lk(score_garde,name): class_sort = score_garde.loc[score_garde.班级 == name, :] class_sort = class_sort.sort_values(by=['总'], ascending=False) class_sort['班级排名'] = class_sort['总'].rank(ascending=0,method='min') class_sort.to_excel(writer_lk, index=None, sheet_name=name) lk_grade_sort = sort_grade(score_lk) lk_grade_sort.to_excel(writer_lk, index=None, sheet_name='学年成绩') for lk in lk_class: class_sort = sort_class_lk(score_lk, lk) writer_lk.save() writer_lk.close() # #============整理出文科成绩成绩、计算总、总排名、班级排名================== wk_yw = yw_score.loc[(yw_score.班级.isin(wk_class)), ['班级','姓名','语文']] wk_sx = sxwk_score[['姓名','数学']] wk_yy = yy_score.loc[(yy_score.班级.isin(wk_class)), ['姓名','英语']] wk_k3 = wkzh_score[['姓名','政治','历史','地理','文综']] wk_list = [wk_yw, wk_sx, wk_yy, wk_k3] score_wk = (reduce(lambda left, right: pd.merge(left, right, on='姓名'), wk_list)) score_wk['总'] = (score_wk['语文'] + score_wk['数学'] + score_wk['英语'] + score_wk['文综']).round(0) def sort_class_wk(score_garde,name): class_sort = score_garde.loc[score_garde.班级 == name, :] class_sort = class_sort.sort_values(by=['总'], ascending=False) class_sort['班级排名'] = class_sort['总'].rank(ascending=0,method='min') class_sort.to_excel(writer_wk, index=None, sheet_name=name) wk_grade_sort = sort_grade(score_wk) wk_grade_sort.to_excel(writer_wk, index=None, sheet_name='学年成绩') for wk in wk_class: class_sort = sort_class_wk(wk_grade_sort, wk) writer_wk.save() writer_wk.close() 2、成绩区间割与统计 #coding:utf-8 import numpy as np import pandas as pd from functools import reduce fpath_lk="./整理后的成绩/2020一模理科总成绩及各班级成绩.xlsx" fpath_wk="./整理后的成绩/2020一模文科总成绩及各班级成绩.xlsx" writer_lk = pd.ExcelWriter('./整理后的成绩/2020一模理科成绩区间布统计.xlsx') writer_wk = pd.ExcelWriter('./整理后的成绩/2020一模文科成绩区间布统计.xlsx') lk = pd.read_excel(fpath_lk, None) #获取表格中的所有工作表的内容 wk = pd.read_excel(fpath_wk, None) #===================1.定义区间割函数===================================== def cut_750(score_750,len): bins_750= [0,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590,600,620,640,660,750] labels_750 = ['0-370','370-379','380-389','390-399','400-409','410-419','420-429','430-439','440-449','450-459','460-469','470-479','480-489','490-499','500-509','510-519','520-529','530-539','540-549','550-559','560-569','570-579','580-589','590-599','600-619','620-639','640-659','660-750'] cut_750 = pd.cut(score_750, bins_750, labels=labels_750, right=False) qj = pd.DataFrame({'区间':pd.value_counts(cut_750).index,'人数':pd.value_counts(cut_750),'百比':((pd.value_counts(cut_750))/len).round(3).apply(lambda x: format(x, '.2%'))}).sort_values(by='区间', ascending=False) qj = qj.reset_index(drop=True) return qj def cut_150(score_150,len): bins_150 = [0,30,60,90,120,150] labels_150 = ['0-30', '30-60', '60-90', '90-120', '120-150'] cut_150 = pd.cut(score_150, bins_150, labels=labels_150, right=False) qj = pd.DataFrame({'区间':pd.value_counts(cut_150).index,'人数':pd.value_counts(cut_150),'百比':((pd.value_counts(cut_150))/len).round(3).apply(lambda x: format(x, '.2%'))}).sort_values(by='区间') 其他源代码及始数据已上传,欢迎各位借鉴,第一次编程,希望网友们能指点不足之处,联系qq:912182988
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值