Big Data Analytics 笔记整理 3

1 WHY Principal Component Analysis ?

PCA is useful to:

  • reduce number of features (might reduce overfiting)
  • reduce memory or disk storage required for features
  • speed up execution of subsequent modelling step(s)
  • visualization of features for higher order model
  • find unknown structure in features via subsequent clustering
  • detect outliers
  • Usually in data analysis scheme:
    • Scale the covariates
    • Split the data into training and test set
    • Apply PCA to training and test separately
    • Build a model using features generated by PCA on training set
    • Assess prediction accuracy using features generated by PCA on test set

2 principle component

  • by setting a threshold of variance to maintain, like 80%
  • Cattell’s or Kaiser’s methods
  • cross-validation

2.1 Concept of PCA

Suppose matrix X   : n × p X~:n\times p X :n×p, p p p is the number of features, Σ = V a r ( X ) = X T X \Sigma=Var(X)=X^{\mathsf{T}}X Σ=Var(X)=XTX is the p × p p\times p p×p covariance matrix of X X X.

For principle component a i ⃗ \vec{a_i} ai , we want to project data in the direction of a i ⃗ \vec{a_i} ai (maximise the variance) ⟹ \Longrightarrow m a x a i ⃗ V a r ( X a i ⃗ ) = m a x a i ⃗ a i ⃗ T a i ⃗ \underset{\vec{a_i}}{max}Var(X\vec{a_i})=\underset{\vec{a_i}}{max}\vec{a_i}^{\mathsf{T}}\vec{a_i} ai maxVar(Xai )=ai maxai Tai , since this does not have upper bound, so:

Constraint ∣ ∣ a i ⃗ ∣ ∣ = a i ⃗ T a i ⃗ = 1 ||\vec{a_i}||=\vec{a_i}^{\mathsf{T}}\vec{a_i}=1 ai =ai Tai =1, use Lagrange fomula and Spectral decomposition of Σ \Sigma Σ

2.2 Useful result of PCA


Δ = d i a g ( λ 1 , . . . , λ p ) \Delta=diag(\lambda_1,...,\lambda_p) Δ=diag(λ1,...,λ

  • 0
  • 0
    觉得还不错? 一键收藏
  • 0
Series: Chapman & Hall/CRC Mathematical and Computational Biology Hardcover: 294 pages Publisher: Chapman and Hall/CRC (December 22, 2015) Language: English ISBN-10: 1498724523 ISBN-13: 978-1498724524 Demystifies Biomedical and Biological Big Data Analyses Big Data Analysis for Bioinformatics and Biomedical Discoveries provides a practical guide to the nuts and bolts of Big Data, enabling you to quickly and effectively harness the power of Big Data to make groundbreaking biological discoveries, carry out translational medical research, and implement personalized genomic medicine. Contributing to the NIH Big Data to Knowledge (BD2K) initiative, the book enhances your computational and quantitative skills so that you can exploit the Big Data being generated in the current omics era. The book explores many significant topics of Big Data analyses in an easily understandable format. It describes popular tools and software for Big Data analyses and explains next-generation DNA sequencing data analyses. It also discusses comprehensive Big Data analyses of several major areas, including the integration of omics data, pharmacogenomics, electronic health record data, and drug discovery. Accessible to biologists, biomedical scientists, bioinformaticians, and computer data analysts, the book keeps complex mathematical deductions and jargon to a minimum. Each chapter includes a theoretical introduction, example applications, data analysis principles, step-by-step tutorials, and authoritative references


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


