实验目的:了解大数据
实验内容:对提供的数据集——农村居民人均可支配收入来源2016——进行主成分分析,并分析结果。
代码
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def zhuchengfenfenxi(X_T):
# 主成分分析需要进入的数据每一维是一个行向量,所以这里的输入是X_T(类型:矩阵)
X = X_T.T # X是X_T的转置
for i in range(len(X)): # 对X的每一行(每一维)进行零均值化
X[i] = X[i] - np.mean(X[i]) # 零均值化 减去平均值
C = X.dot(X.T) / (len(X[0]) - 1) # 求协方差阵 Cov = (X XT)/(n-1)
[l1, l2] = np.linalg.eig(C) # 求特征值和特征向量
l = [(l1[i], l2.T[i]) for i in range(len(l1))] # 合并特征向量和特征值
l.sort(key=lambda x: x[0], reverse=True) # sort排序,key接受一个函数返回值,reverse 从大到小
U = np.array([list(i[1]) for i in l]) # 去掉第一个数得到矩阵
matrix = U.T.dot(X) # 新坐标系下的坐标
corr = np.round(np.corrcoef(X, matrix), 4) # 相关系数,取四位小数
return matrix.T, corr
df = pd.read_excel("C:\\Users\\hp\\Desktop\\农村居民人均可支配收入来源2016.xlsx", header=0, index_col=0) # 读取文件,并存入df
np.set_printoptions(suppress=True) # 输出不用科学计数法
print(df) # 输出一下看看
df_np = np.array(df)
data, corr = zhuchengfenfenxi(df_np) # 取全部数值,扔进函数处理
print('\n主成分:===========================================\n', data)
# 原属性与主成分的相关性分析
chart = {'yuanshuxing': []} # 字典,用于绘制表格
for i in range(len(data[0])): # 这里是4个列+1列原属性
chart['yuanshuxing'].append(df.columns[i]) # 原属性添加一个
chart['F' + str(i)] = corr[len(data[0]) + i, :len(data[0])]
df_corr = pd.DataFrame(chart)
# 绘制相关性分析的表格
fig, ax = plt.subplots(figsize=(10, 6))
ax.axis('off')
table = ax.table(cellText=df_corr.values, colLabels=df_corr.columns, cellLoc='center', loc='center', edges='horizontal')
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1.2, 1.2)
plt.title('shuxingyuzhuchengfendexianguanxingfenxi')
plt.show()
# 热力图
plt.figure(figsize=(10, 8))
heatmap = plt.pcolor(df_corr.iloc[:, 1:].values, cmap='coolwarm')
plt.xticks(np.arange(0.5, len(df_corr.columns) - 1), df_corr.columns[1:], rotation=90)
plt.yticks(np.arange(0.5, len(df_corr['yuanshuxing'])), df_corr['yuanshuxing'])
plt.colorbar(heatmap, label='Correlation')
plt.title('Correlation Heatmap')
plt.show()
# 打印与各主成分高度相关的原属性
print('主成分0与工资性收入和资产净收入高度相关')
print('主成分1与经营净收入高度相关')
print('主成分2与转移净收入高度相关')
print('主成分3与工资性收入和资产净收入高度相关')
运行结果:
E:\Anaconda\envs\pythonProject1\python.exe E:\pythonProject1\dashuju.py
gongzishouru jingyingshouru caichanshouru zhuanyijingshouru
address
北京 16637.5 2061.9 1350.1 2260.0
天津 12048.1 5309.4 893.7 1824.4
河北 6263.2 3970.0 257.5 1428.6
山西 5204.4 2729.9 149.0 1999.1
内蒙古 2448.9 6215.7 452.6 2491.7
辽宁 5071.2 5635.5 257.6 1916.4
吉林 2363.1 7558.9 231.8 1969.1
黑龙江 2430.5 6425.9 572.7 2402.6
上海 18947.9 1387.9 859.6 4325.0
江苏 8731.7 5283.1 606.0 2984.8
浙江 14204.3 5621.9 661.8 2378.1
安徽 4291.4 4596.1 186.7 2646.2
福建 6785.2 5821.5 255.7 2136.9
江西 4954.7 4692.3 204.4 2286.4
山东 5569.1 6266.6 358.7 1759.7
河南 4228.0 4643.2 168.0 2657.6
湖北 4023.0 5534.0 158.6 3009.3
湖南 4946.2 4138.6 143.1 2702.5
广东 7255.3 3883.6 365.8 3007.5
广西 2848.1 4759.2 149.2 2603.0
海南 4764.9 5315.7 139.1 1623.1
重庆 3965.6 4150.1 295.8 3137.3
四川 3737.6 4525.2 268.5 2671.8
贵州 3211.0 3115.8 67.1 1696.3
云南 2553.9 5043.7 152.2 1270.1
西藏 2204.9 5237.9 148.7 1502.3
陕西 3916.0 3057.9 159.0 2263.6
甘肃 2125.0 3261.4 128.4 1942.0
青海 2464.3 3197.0 325.2 2677.8
宁夏 3906.1 3937.5 291.8 1716.3
新疆 2527.1 5642.0 222.8 1791.3
主成分:===========================================
[[11175.17747796 1145.31212344 642.42710542 1422.08187054]
[ 6241.20017059 -1473.51117445 -153.33283631 1008.2032531 ]
[ 765.25352128 564.78815198 -804.61833565 -112.3500915 ]
[ -137.69308575 1900.48909434 -245.9833908 -403.3232274 ]
[-3382.80191948 -1143.82227841 -31.66130276 80.32186517]
[ -670.25349718 -929.05941403 -442.55045685 -15.65986121]
[-3612.05120033 -2464.02231119 -605.12565627 -14.89659893]
[-3430.32994147 -1333.0129521 -130.13569903 218.65548144]
[13452.02666602 1388.7190569 2855.13886238 1025.02150966]
[ 2914.98390346 -1065.88474492 829.98666818 526.84175662]
[ 8306.00730144 -2110.89098324 505.07456239 944.68806537]
[-1336.38923149 171.11790561 282.38586963 -231.10271063]
[ 984.08516096 -1348.59877703 -137.44220208 107.12100035]
[ -674.36038211 -0.64675653 -44.74511439 -166.36722799]
[ -261.09855152 -1602.15690734 -596.03622886 176.54272133]
[-1405.30134944 130.92330895 288.78487097 -248.77153006]
[-1750.76169073 -729.66321542 596.29465814 -179.36974703]
[ -627.85087129 527.54773367 390.68944126 -279.49120279]
[ 1659.70914981 484.08973861 826.42129838 59.48726439]
[-2780.06785977 201.36901678 156.25683507 -340.45235706]
[ -906.01698684 -577.48261984 -739.07180973 -186.08078721]
[-1630.39166109 651.94896615 770.44151553 -183.04639871]
[-1878.84481668 323.27886353 280.22560126 -190.45588752]
[-2138.1707698 1787.67345363 -668.43993355 -570.20628923]
[-3032.11951723 -1.65349167 -1199.97656924 -337.61051391]
[-3416.9971888 -152.94653823 -994.0394551 -341.24740907]
[-1471.31816661 1744.02408916 -63.1456558 -437.26462557]
[-3248.01843791 1789.68452271 -487.24149817 -559.12584687]
[-2956.15375851 1808.00309747 266.04305426 -343.61354606]
[-1576.60275873 908.68350606 -642.9235171 -222.91398654]
[-3174.84970877 -594.30046461 -703.70068118 -205.61494266]]
主成分0与工资性收入和资产净收入高度相关
主成分1与经营净收入高度相关
主成分2与转移净收入高度相关
主成分3与工资性收入和资产净收入高度相关
进程已结束,退出代码0