【数据挖掘】主成分分析Python实现

前言

是对一个数据挖掘作业的记录,数据集是老师提供的几种癌症的数据,我是直接在Jupyter中写的,中间会输出一些内容验证之类的

主要是参照这位写的:参考的大佬

数据预处理

导入库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

读取数据

# 读取数据,查看数据的规模
BLCA_data = pd.read_csv(r'数据集/BLCA/rna.csv')
print('BLCA',BLCA_data.shape)
BRCA_data = pd.read_csv(r'数据集/BRCA/rna.csv')
print('BRCA',BRCA_data.shape)
KIRC_data = pd.read_csv(r'数据集/KIRC/rna.csv')
print('KIRC',KIRC_data.shape)
LUAD_data = pd.read_csv(r'数据集/LUAD/rna.csv')
print('LUAD',LUAD_data.shape)
PAAD_data = pd.read_csv(r'数据集/PAAD/rna.csv')
print('PAAD',PAAD_data.shape)
BLCA (3217, 400)
BRCA (3217, 1032)
KIRC (3217, 489)
LUAD (3217, 491)
PAAD (3217, 177)
# 查看其中一种类型癌症数据集的前6行
BLCA_data.head(6)
gene_idTCGA-HQ-A2OFTCGA-GU-A767TCGA-ZF-AA4RTCGA-DK-A1ACTCGA-DK-A3ITTCGA-GC-A3RDTCGA-BT-A0YXTCGA-FD-A6TETCGA-E7-A5KE...TCGA-BT-A0S7TCGA-K4-A6FZTCGA-E5-A2PCTCGA-DK-AA6QTCGA-XF-AAMYTCGA-CU-A0YOTCGA-E7-A7PWTCGA-S5-A6DXTCGA-GD-A76BTCGA-ZF-AA4V
0A2BP1|54715-0.402918-1.236959-1.453230-1.211589-1.288438-1.262086-1.033565-1.183245-1.076254...-1.282740-1.023637-0.366722-1.161285-1.380649-1.205136-0.875586-1.059148-1.393928-1.215641
1A2ML1|1445680.7175020.7378911.5846431.4719320.4140151.8210842.5158980.7238300.668703...0.1639031.3182290.8791350.5136031.5715571.6791991.222708-0.0798021.4803931.947340
2ACTL6B|51412-1.185185-1.403906-1.453230-1.211589-1.288438-1.262086-1.196075-1.183245-0.937518...-1.282740-1.291842-1.036778-1.161285-1.380649-1.368904-1.567590-1.425863-1.393928-1.215641
3ADAM6|87550.3738261.4112792.0922823.1864382.9674571.4371132.7237752.3200120.736963...2.3754131.9643691.2020001.8244192.1850512.1758491.4463523.5918723.2668182.514385
4ADAMDEC1|27299-0.489828-0.894004-0.2051691.0151790.0895400.1149030.6644650.138955-1.076254...-0.408256-0.063113-0.1471730.177490-0.2540070.173888-0.9517120.5927500.2603820.758897
5ALDH1A3|2200.4201700.5634041.0642570.4608611.5182041.8093201.2309121.4882750.809302...1.1800981.7519190.0348351.5120290.8514591.6452710.3982170.599149-0.0642561.339531

6 rows × 400 columns

# 查看另一种癌症数据集的前6行
BRCA_data.head(6)
gene_idTCGA-D8-A1XJTCGA-EW-A6SATCGA-D8-A27HTCGA-E9-A247TCGA-AC-A2FGTCGA-B6-A0RPTCGA-BH-A0DHTCGA-AN-A0AMTCGA-D8-A1XZ...TCGA-BH-A18GTCGA-E2-A15JTCGA-BH-A1F6TCGA-EW-A1P8TCGA-AO-A1KTTCGA-A2-A0D0TCGA-3C-AALKTCGA-AO-A1KSTCGA-EW-A1OWTCGA-OL-A5RU
0A2BP1|54715-1.069946-1.075002-1.377040-1.307689-1.292405-1.231926-1.231272-1.396638-1.229635...-1.079587-1.018372-1.355431-1.156045-1.337455-1.189837-1.467466-1.339815-1.315667-1.421373
1A2ML1|1445680.255748-0.8772501.623317-1.307689-0.718106-1.135720-1.2312720.2535690.136528...-0.496468-1.1358401.8704740.995603-0.732074-0.500924-1.075625-0.7083170.481910-0.781841
2ACTL6B|51412-1.253764-1.075002-1.377040-1.307689-1.458347-1.231926-1.231272-1.396638-1.229635...-1.079587-1.135840-1.200037-1.277135-1.337455-1.189837-1.467466-1.339815-1.162777-1.421373
3ADAM6|87551.9493212.6036302.0166461.9368392.8512062.1593153.4946722.6651963.116796...1.5030651.5235133.2383822.4766802.0394491.6424552.5295461.5635752.5194193.503925
4ADAMDEC1|27299-0.219076-0.343607-0.1101690.410611-0.181777-0.254802-0.0005190.2185260.521942...-0.0276350.0736841.2149230.885440-0.1456220.4669230.3306900.4859080.6840710.723861
5ALDH1A3|2201.1385330.2906962.2392960.7558491.3271230.7404711.1014511.0915980.884671...2.2519041.1351051.9782011.8279460.8346571.2521291.1592140.9497911.2150540.805760

6 rows × 1032 columns

经过上面的分析可以知道对于不同的数据集,它们的行数和行标签都是相同的,但列数和列标签都是不同的
所以对于这些数据集来说,行标签可以看作是数据的不同特征,而每一列则对应一个样本的数据

保留第一个数据集的第一列,然后将其与后面几个数据集除第一列之外的列进行合并
则:

BLCA:1-400

BRCA:401-1431

KIRC:1432-1919

LUAD:1920-2409

PAAD:2410-2585

将五种癌症数据集合并

# 将5种癌症数据集的除gene_id列之外的列进行合并
DataSet = pd.concat([BLCA_data,BRCA_data.iloc[:,1:],KIRC_data.iloc[:,1:],LUAD_data.iloc[:,1:],PAAD_data.iloc[:,1:]], axis=1)
DataSet.head(6)
gene_idTCGA-HQ-A2OFTCGA-GU-A767TCGA-ZF-AA4RTCGA-DK-A1ACTCGA-DK-A3ITTCGA-GC-A3RDTCGA-BT-A0YXTCGA-FD-A6TETCGA-E7-A5KE...TCGA-FB-A7DRTCGA-H8-A6C1TCGA-LB-A8F3TCGA-IB-7893TCGA-2J-AABFTCGA-US-A77JTCGA-F2-7273TCGA-HZ-A8P0TCGA-3A-A9IHTCGA-2L-AAQA
0A2BP1|54715-0.402918-1.236959-1.453230-1.211589-1.288438-1.262086-1.033565-1.183245-1.076254...-1.046787-1.561721-1.256692-1.345364-1.279792-1.621289-1.054519-1.455548-1.464518-1.471927
1A2ML1|1445680.7175020.7378911.5846431.4719320.4140151.8210842.5158980.7238300.668703...-1.226226-0.680184-1.0383661.252544-0.752025-1.621289-1.5047790.6809930.666947-0.915410
2ACTL6B|51412-1.185185-1.403906-1.453230-1.211589-1.288438-1.262086-1.196075-1.183245-0.937518...-1.126668-1.141486-0.768676-1.013273-0.542150-0.338250-0.368487-1.455548-0.454760-1.471927
3ADAM6|87550.3738261.4112792.0922823.1864382.9674571.4371132.7237752.3200120.736963...2.5005392.5343271.8954972.5237382.7714873.5085402.5030073.0311952.0188572.171331
4ADAMDEC1|27299-0.489828-0.894004-0.2051691.0151790.0895400.1149030.6644650.138955-1.076254...0.3816450.286921-0.266635-0.1615850.3809470.7720600.254188-0.122049-0.5961580.098888
5ALDH1A3|2200.4201700.5634041.0642570.4608611.5182041.8093201.2309121.4882750.809302...1.5305321.1073930.6186891.2324311.0621131.0352721.2203921.0525040.9858690.812195

6 rows × 2585 columns

对数据集进行转置,之后每一行对应一个样本数据,每一列表示一个样本特征

DataSet = DataSet.T
DataSet.head(6)
0123456789...3207320832093210321132123213321432153216
gene_idA2BP1|54715A2ML1|144568ACTL6B|51412ADAM6|8755ADAMDEC1|27299ALDH1A3|220ATCAY|85300ATP10B|23120BCAS1|8537C10orf105|414152...SLC38A5|92745SLC44A3|126969SSX4|6759TMEM200B|399474TMEM26|219623TNFRSF10D|8793XAF1|54739YJEFN3|374887ZMAT1|84460ZNF415|55786
TCGA-HQ-A2OF-0.4029180.717502-1.1851850.373826-0.4898280.42017-1.1851851.8414921.327818-0.301422...0.2758461.228001-0.959773-0.5305590.6370471.3397060.5203050.235093-0.3839920.795605
TCGA-GU-A767-1.2369590.737891-1.4039061.411279-0.8940040.563404-1.2369591.3099582.0158290.416162...0.3643311.484526-1.4039060.689511-0.4742660.3785060.9946470.7765980.3343630.577015
TCGA-ZF-AA4R-1.453231.584643-1.453232.092282-0.2051691.064257-1.2679460.2074750.764633-0.786075...1.3500111.03154-1.453230.867536-0.1455150.229991.2100680.3030780.912720.70952
TCGA-DK-A1AC-1.2115891.471932-1.2115893.1864381.0151790.460861-0.9958341.5763760.546906-0.855604...0.6376911.190068-0.3486510.0346410.0309730.093971.0895230.5838130.5031980.733853
TCGA-DK-A3IT-1.2884380.414015-1.2884382.9674570.089541.518204-1.159105-0.0351451.5758490.220414...0.5981911.479815-1.2884380.921235-0.30770.4860691.3115230.4831390.2625080.589681

6 rows × 3217 columns

# 此时第一行为特征标签,将非数据的这一行去掉
DataSet = DataSet.iloc[1:,:]
DataSet.iloc[:6,:]
0123456789...3207320832093210321132123213321432153216
TCGA-HQ-A2OF-0.4029180.717502-1.1851850.373826-0.4898280.420170-1.1851851.8414921.327818-0.301422...0.2758461.228001-0.959773-0.5305590.6370471.3397060.5203050.235093-0.3839920.795605
TCGA-GU-A767-1.2369590.737891-1.4039061.411279-0.8940040.563404-1.2369591.3099582.0158290.416162...0.3643311.484526-1.4039060.689511-0.4742660.3785060.9946470.7765980.3343630.577015
TCGA-ZF-AA4R-1.4532301.584643-1.4532302.092282-0.2051691.064257-1.2679460.2074750.764633-0.786075...1.3500111.031541-1.4532300.867536-0.1455150.2299901.2100680.3030780.9127200.709520
TCGA-DK-A1AC-1.2115891.471932-1.2115893.1864391.0151790.460861-0.9958341.5763760.546906-0.855604...0.6376911.190068-0.3486510.0346410.0309730.0939701.0895230.5838130.5031980.733853
TCGA-DK-A3IT-1.2884380.414015-1.2884382.9674570.0895401.518204-1.159105-0.0351451.5758490.220414...0.5981911.479815-1.2884380.921235-0.3077000.4860691.3115220.4831390.2625080.589681
TCGA-GC-A3RD-1.2620871.821084-1.2620871.4371130.1149031.809320-0.9936030.8784512.0582530.036055...1.0695561.383669-0.2155580.234100-0.3244770.3796891.323748-0.0261470.6381230.096817

6 rows × 3217 columns

# 重新设置数据类型,此时DataSet为一个2584x3217的浮点数矩阵
DataSet = DataSet.astype('float32')
print(DataSet.shape)
(2584, 3217)

进行主成分分析

计算样本均值

Mean_vec = np.mean(DataSet, axis=0)
print('样本的均值:\n',Mean_vec)
样本的均值:
 0      -1.161548
1      -0.474133
2      -1.251174
3       2.412248
4       0.266737
          ...   
3212    0.521786
3213    1.303051
3214    0.133573
3215    0.774802
3216    0.601425
Length: 3217, dtype: float32

计算样本协方差矩阵

'''
# 纯数据运算
Cov_mat = (DataSet-Mean_vec).T.dot(DataSet-Mean_vec)/(DataSet.shape[0]-1)
print('协方差矩阵:\n',Cov_mat)
'''
# 使用numpy中计算协方差的函数
'''
协方差矩阵用于衡量两个变量之间相互依赖的程度
由于有3217个feature,故协方差矩阵的规模为3217x3217
'''
Cov_mat = np.cov(DataSet.T)
print('样本协方差矩阵:\n',Cov_mat)
Cov_mat.shape
样本协方差矩阵:
 [[ 0.12676448 -0.02361515  0.02157561 ...  0.00519793  0.00904992
   0.00673015]
 [-0.02361515  1.00118482 -0.0053748  ...  0.07527039 -0.17714864
   0.0090866 ]
 [ 0.02157561 -0.0053748   0.13757093 ...  0.00533166 -0.00134267
   0.00309703]
 ...
 [ 0.00519793  0.07527039  0.00533166 ...  0.22596547  0.02969978
   0.01703536]
 [ 0.00904992 -0.17714864 -0.00134267 ...  0.02969978  0.20364439
   0.04410373]
 [ 0.00673015  0.0090866   0.00309703 ...  0.01703536  0.04410373
   0.13098872]]
(3217, 3217)

计算特征值和特征向量

EigenValues, EigenVector = np.linalg.eig(Cov_mat)
# 数据特征量较大,这里计算结果为复数,统一取其实部
EigenValues = EigenValues.real
EigenVector = EigenVector.real
print('特征值:\n',EigenValues)
print('特征向量:\n',EigenVector)
特征值:
 [ 2.31279350e+02  1.46770122e+02  8.81143724e+01 ...  2.86122668e-17
  2.84073038e-17 -3.12152010e-17]
特征向量:
 [[ 0.00030864 -0.00113702 -0.00123991 ... -0.00242919 -0.00277321
   0.00190791]
 [-0.02456387  0.02533867 -0.05603828 ...  0.00124872  0.00140361
  -0.00075202]
 [-0.00023562 -0.00033786 -0.00340553 ... -0.00034539 -0.00039286
   0.00087827]
 ...
 [-0.00238575  0.00450426 -0.01746374 ... -0.00422026 -0.00422903
   0.00272333]
 [ 0.00781593 -0.02092816  0.00775456 ...  0.01314091  0.01272775
   0.01905482]
 [-0.0030166  -0.01263667 -0.0111954  ...  0.02344654  0.03139416
   0.01099676]]
# EigenPairs[i][0]表示一个特征值,EigenPairs[i][1]表示该特征值所对应的特征向量
EigenPairs = [(np.abs(EigenValues[i]), EigenVector[:,i]) for i in range(len(EigenValues))]

EigenPairs.sort(key=lambda x:x[0], reverse=True)
print('特征值降序排列:')
for i in EigenPairs:
    print(i[0])
特征值降序排列:
231.2793503404339
146.77012152572306
88.11437236590861
......
1.6398199526831266e-18
1.4276145590349296e-18
1.4276145590349296e-18

计算累计方差贡献率

# 计算方差贡献率
tot = sum(EigenValues)
var_exp = [(i/tot)*100 for i in sorted(EigenValues, reverse=True)]
print(var_exp)
# 累计方差贡献率
Cum_var_exp = np.cumsum(var_exp)
Cum_var_exp
[18.501009256397158, 11.740760136661512, ... , -1.3125279499634018e-16, -2.0754934203007648e-16]
array([ 18.50100926,  30.24176939,  37.29040913, ...,  100.,  100. ,  100.])

特征向量选取

## 这里为方便作图,选取特征值较大的前两维的特征向量
Matrix = np.hstack((EigenPairs[0][1].reshape(DataSet.shape[1], 1),
                    EigenPairs[1][1].reshape(DataSet.shape[1], 1)))
print('Matrix:\n',Matrix)
Matrix:
 [[ 0.00030864 -0.00113702]
 [-0.02456387  0.02533867]
 [-0.00023562 -0.00033786]
 ...
 [-0.00238575  0.00450426]
 [ 0.00781593 -0.02092816]
 [-0.0030166  -0.01263667]]

结果可视化

数据降维

# tmp主要是将DaraSet转成array类型
HighDimSet = DataSet.iloc[:,:].values
# 将原数据矩阵乘以选取的特征向量进行降维
LowDimSet = HighDimSet.dot(Matrix)
LowDimSet
array([[ -2.78179211,   7.81840477],
       [-13.39504631,   7.30085955],
       [-16.8183728 ,   3.87839756],
       ...,
       [ -5.81835218,   8.30323684],
       [ -8.52738732,   8.17363152],
       [ -6.97177986,  11.89669388]])

降维前

# LabelSet标识哪一行样本属于哪种癌症
LabelSet = ['BLCA']*399+['BRCA']*1031+['KIRC']*488+['LUAD']*490+['PAAD']*176
LabelSet = np.array(LabelSet)
LabelSet
array(['BLCA', 'BLCA', 'BLCA', ..., 'PAAD', 'PAAD', 'PAAD'], dtype='<U4')
# 选取两种特征对不同的癌症种类进行区分
plt.figure(figsize=(10,6), dpi=80)
plt.xlim(-3,3)
plt.ylim(-3,3)
for lab,color in zip(('BLCA', 'BRCA', 'KIRC', 'LUAD', 'PAAD'),
                     ('red','yellow','green','blue','purple')):
    plt.scatter(HighDimSet[LabelSet==lab,0],
                HighDimSet[LabelSet==lab,1],
                label=lab,c=color)
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.legend(loc='upper right')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rE00qtLi-1666523848432)(output_34_0.png)]

降维后

plt.figure(figsize=(10,6), dpi=80)
plt.xlim(-25,40)
plt.ylim(-30,25)
for lab,color in zip(('BLCA', 'BRCA', 'KIRC', 'LUAD', 'PAAD'),
                     ('red','yellow','green','blue','purple')):
    plt.scatter(LowDimSet[LabelSet==lab,0],
                LowDimSet[LabelSet==lab,1],
               label=lab,c=color)
plt.xlabel('principal component 1')
plt.ylabel('principal component 2')
plt.legend(loc='upper right')
plt.show()

在这里插入图片描述

写在最后

文章仅作记录,至于原理还有很多不懂的地方,结果我也不知道该是什么样的,把用这么多维特征来区分的事物降到两维来进行区分,我自己感觉已经很神奇了哈哈!

  • 9
    点赞
  • 28
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

鱼树C

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值