Big Data Analytics 笔记整理 3

1 WHY Principal Component Analysis ?

PCA is useful to:

  • reduce number of features (might reduce overfiting)
  • reduce memory or disk storage required for features
  • speed up execution of subsequent modelling step(s)
  • visualization of features for higher order model
  • find unknown structure in features via subsequent clustering
  • detect outliers
  • Usually in data analysis scheme:
    • Scale the covariates
    • Split the data into training and test set
    • Apply PCA to training and test separately
    • Build a model using features generated by PCA on training set
    • Assess prediction accuracy using features generated by PCA on test set

2 principle component

https://blog.csdn.net/MINGRAN_JIA/article/details/122464746

  • by setting a threshold of variance to maintain, like 80%
  • Cattell’s or Kaiser’s methods
  • cross-validation

2.1 Concept of PCA

Suppose matrix X   : n × p X~:n\times p X :n×p, p p p is the number of features, Σ = V a r ( X ) = X T X \Sigma=Var(X)=X^{\mathsf{T}}X Σ=Var(X)=XTX is the p × p p\times p p×p covariance matrix of X X X.

For principle component a i ⃗ \vec{a_i} ai , we want to project data in the direction of a i ⃗ \vec{a_i} ai (maximise the variance) ⟹ \Longrightarrow m a x a i ⃗ V a r ( X a i ⃗ ) = m a x a i ⃗ a i ⃗ T a i ⃗ \underset{\vec{a_i}}{max}Var(X\vec{a_i})=\underset{\vec{a_i}}{max}\vec{a_i}^{\mathsf{T}}\vec{a_i} ai maxVar(Xai )=ai maxai Tai , since this does not have upper bound, so:

Constraint ∣ ∣ a i ⃗ ∣ ∣ = a i ⃗ T a i ⃗ = 1 ||\vec{a_i}||=\vec{a_i}^{\mathsf{T}}\vec{a_i}=1 ai =ai Tai =1, use Lagrange fomula and Spectral decomposition of Σ \Sigma Σ

2.2 Useful result of PCA

Collossary:

Δ = d i a g ( λ 1 , . . . , λ p ) \Delta=diag(\lambda_1,...,\lambda_p) Δ=diag(λ1,...,λ

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Series: Chapman & Hall/CRC Mathematical and Computational Biology Hardcover: 294 pages Publisher: Chapman and Hall/CRC (December 22, 2015) Language: English ISBN-10: 1498724523 ISBN-13: 978-1498724524 Demystifies Biomedical and Biological Big Data Analyses Big Data Analysis for Bioinformatics and Biomedical Discoveries provides a practical guide to the nuts and bolts of Big Data, enabling you to quickly and effectively harness the power of Big Data to make groundbreaking biological discoveries, carry out translational medical research, and implement personalized genomic medicine. Contributing to the NIH Big Data to Knowledge (BD2K) initiative, the book enhances your computational and quantitative skills so that you can exploit the Big Data being generated in the current omics era. The book explores many significant topics of Big Data analyses in an easily understandable format. It describes popular tools and software for Big Data analyses and explains next-generation DNA sequencing data analyses. It also discusses comprehensive Big Data analyses of several major areas, including the integration of omics data, pharmacogenomics, electronic health record data, and drug discovery. Accessible to biologists, biomedical scientists, bioinformaticians, and computer data analysts, the book keeps complex mathematical deductions and jargon to a minimum. Each chapter includes a theoretical introduction, example applications, data analysis principles, step-by-step tutorials, and authoritative references

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值