Program = Algorithm + Data + Structure + Model + Tools + OS
- OS
UNIX:
Shell:
GPU: Parallel Cuda
Simulation Demo not sample:
Network:Tcpip Https Sockets Session SSL
- Tools
No SQL :Distributed real-time massive concurrent throughput
SQL:MySQL Oracle
Mini SQL : Excel:
Graph SQL:Noe4j OrientDB
statistical programming language:Python Scala, R or MATLAB/Octave
Python:
Third Tools: OpenCV OpenGL OpenCL OpenMP
Third Tools Of ML:Caffe、MXNet、Tensorflow、Torch
- Model:
Math Model:
Design Model:
UML:
- Structure:
Basic:Stack Heap Tree Map
High:Thread JVM
- Data
BI : PowerBI fineBI tableau
SQL
NOSQL
GraphSQL Noe4j Janus-graph GCN
- Algorithm:
- Classic Algorithem: Ranking
(Relevance Ranking Model):
(Boolean Model) (Vector Space Model) (Latent Semantic Analysis),BM25,LMIR
(Importance Ranking Model)
PageRank,HITS,HillTop,TrustRank
Learning to Rank(LTR
信息检索(IR) 、 自然语言处理(NLP) 和 数据挖掘(DM)
1) WTA(Winners take all) 对于给定的查询q,如果模型返回的结果列表中,第一个文档是相关的,则WTA(q)=1,否则为0.
2) MRR(Mean Reciprocal Rank) 对于给定查询q,如果第一个相关的文档的位置是R(q),则MRR(q)=1/R(q)。
3) MAP(Mean Average Precision) 对于每个真实相关的文档d,考虑其在模型排序结果中的位置P(d),统计该位置之前的文档集合的分类准确率,取所有这些准确率的平均值。
4) NDCG(Normalized Discounted Cumulative Gain) 是一种综合考虑模型排序结果和真实序列之间的关系的一种指标,也是最常用的衡量排序结果的指标,详见Wikipedia。
5) RC(Rank Correlation)
-
- "Big 3" Classification Clustering Regression:
NB: Naive Bayes (NB)
Classification Algorithms |
Accuracy | F1-Score |
Logistic Regression | 84.60% | 0.6337 |
Naive Bayes | 80.11% | 0.6005 |
Stochastic Gradient Descent | 82.20% | 0.5780 |
K-Nearest Neighbours | 83.56% | 0.5924 |
Decision Tree | 84.23% | 0.6308 |
Random Forest | 84.33% | 0.6275 |
Support Vector Machine | 84.09% | 0.6145 |
Regression:
k-nearest neighbors
Linear Regression (LASSO Ridge and Elastic-Net) (Regularized) L0,L1,L2 Overfitting
L1-regularized Logistic Regression 、L1 norm
L2-regularized Logistic Regression 、L2 norm
Regression Tree:
Decision Tree: Regressor SVR Bayes
Logistic Regression:
Random forests: (classification tree)
(CV) Classification
Detection:signal detection
Recognition
-
- Kernel Methods:
[SVM] rankings, clusters, or classifications
Logistic
Softmax
-
- Ensemble: Voting, Averaging, Random Forest,
Bagging, Blending, Boosting, Stacking
Bagging+ Decision Tree= Random Forest (RF)
AdaBoost + Decision Tree = Boosting Tree
Gradient Boosting + Decision Tree = GBDT
Voting:
Averaging:
Random Forest: RF :An alternative to Bagging (m=p)
Bagging: Bootstrap sampling,分类——>投票,回归——>平均。
Blending:
Boosting:
Bootstrap:
Adaboost: (Target Recognize、Face Detection)
GBDT: (MART(Multiple Additive Regression Tree
GBRT(Gradient Boosting Regression Tree)
Loss Function:
XGboost:
Stacking:
5-Fold Stacking
-
- Dimensionality reduction :
LDA: Linear Discriminant Analysis [Supervised]
Fisher Linear Discriminant FLD
PCA: Principal component analysis [ Unsupervised ]
SVD: Singular value decomposition
FA:
ICA:
LPP: An alternative to PCA
LLE: Locally linear embedding
TSNE:
LEP:
UV:
Missing Values Ratio
Low Variance Filter
High Correlation Filter
Random Forests
Backward Feature Elimination
Forward Feature Construction
-
- Expectation Maximum (EM:):
(HMM GMM LDA MLE) 非梯度优化
EM &
------------------
EM & GMM:————>
-----------------
EM & K-means
· k-means算法是高斯混合聚类在混合成分方差相等,且每个样本仅指派一个混合成分时候的特例。k-means算法与EM算法的关系是这样的:
· k-means是两个步骤交替进行:确定中心点,对每个样本选择最近中心点--> E步和M步。
· E步中将每个点选择最近的类优化目标函数,分给中心距它最近的类(硬分配),可以看成是EM算法中E步(软分配)的近似。
· M步中更新每个类的中心点,可以认为是在「各类分布均为单位方差的高斯分布」的假设下,最大化似然值;
-
- Nearest Neighbors:
K-means: (Clustering)
Affinity Propagation: (Clustering)
Hierarchical / Agglomerative: (Clustering)
DBSCAN: (Clustering)
KNN:
PageRank:
DBSCAN :
-
- Correlation:
Apriori:Data mining
Affinity Propagation
-
- Neural networks(NN) -> 9
- Math:
Linear algebra
symmetric matrix
Orthogonal matrix
Probability and statistics
Numerical optimization
Multivariable Calculus
Ordinary Least Squares Regression
Stepwise Regression
Multivariate Adaptive Regression Splines
- Tuning: performance index
Tuning of Bugs:
Tuning of Concurrent
Tuning of Online:
Tuning of ML: PCA
SGD,Adagrad,Adadelta,Adam,Adamax,Nadam
| 预测1 | 预测0 |
实际1 | True Positive(TP) | False Negative(FN) |
实际0 | False Positive(FP) | True Negative(TN) |
-
- Classification Performance index
Statistics:Precision (P) 、Recall (R) 、F1、
Accuracy: (acc)
Error rate: (1 - acc )
Precision = 提取出的正确信息条数 / 提取出的信息条数
Recall = 提取出的正确信息条数 / 样本中的信息条数
F-measure F1 = 2PR/ (P+ R) ——> 2/ F1=1/P+1/R
GooSeeker、 Specificity、ROC、AUC
ROC (Receiver Operating Characteristic)
AUC
PSI (population stability index) = sum((实际占比-预期占比)/ln(实际占比/预期占比))
-
- GD:
Stochastic Gradient Descent
Stochastic Average Gradient (sag)
SGD
MBGD
-
- Entropy:
Conditional entropy
Information gain
Information gain ratio
Gini index
-
- Regression Performance index
R^2 、SSE、MSE、RMSE、MAE、R-Squared
- ANN DNN
AlexNet、GoogleNet、Fast/Faster-RCNN、SSD、Yolo、SegNet
- DL ML (deep learning)
• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Multi-task learning
• Cross-validation
- Digital Signal:
Signal Processing: Short Time Fourier Transform, Moving Median Filtering,
Singular Value Decomposition
Text: NLP
NUM: Observing Data - > Finding Features - > Design Algorithms - > Validation of Algorithms - > Washing Data - > Engineering - > On-line Viewing Effect - > Goto Observation Data
Image:
Video: Optical Flow Field, Edge Extraction, Feature Point Extraction, SVM, AdaBoost, Neural Network
- Terminal
Oral Textbook
- Open framework
Face Recognize:
Recommendation DeepFM、Wide & Deep、DIN
Search
Ads
User portrait
DBpedia freebase yago openkg