Spark的ML(Machine Learning)库提供了主流数据统计/挖掘算法的实现,威廉将在本文中做一个总览,具体的解析将会在之后的文章中来写
分类与回归算法
算法 | Spark算法类 | Spark模型类 |
---|
SVM支持向量机 | SVMWithSGD | SVMModel |
Logistic回归 | LogisticRegressionWithLBFGS;LogisticRegressionWithSGD | LogisticRegressionModel |
线性回归 | LinearRegressionWithSGD | LinearRegressionModel |
实时线性回归 | StreamingLinearRegressionWithSGD | LinearRegressionModel |
岭回归 | RidgeRegressionWithSGD | RidgeRegressionModel |
Lasso回归 | LassoWithSGD | LassoModel |
朴素贝叶斯 | NaiveBayes | NaiveBayesModel |
决策树 | DecisionTree | DecisionTreeModel |
随机森林 | RandomForest | RandomForestModel |
Gradient-Boosted Trees | GradientBoostedTrees | GradientBoostedTreesModel |
Isotonic regression | IsotonicRegression | IsotonicRegressionModel |
协同过滤算法
算法 | Spark算法类 | Spark模型类 |
---|
alternating least squares (ALS) | ALS | MatrixFactorizationModel |
聚类算法
算法 | Spark算法类 | Spark模型类 |
---|
k-means | KMeans | KMeansModel |
Gaussian mixture | GaussianMixture | GaussianMixtureModel |
power iteration clustering (PIC) | PowerIterationClustering | PowerIterationClusteringModel |
latent Dirichlet allocation (LDA) | LDA | DistributedLDAModel |
streaming k-means | StreamingKMeans | KMeansModel |
降维算法
算法 | Spark算法类 |
---|
singular value decomposition (SVD) | RowMatrix.computeSVD |
principal component analysis (PCA) | RowMatrix.computePrincipalComponents |
特征提取与转换
算法 | Spark算法类 | Spark模型类 |
---|
TF-IDF | HashingTF;IDF | |
Word2Vec | Word2Vec | Word2VecModel |
Standard Scaler | StandardScaler | StandardScalerModel |
Normalizer | Normalizer | |
频繁项集的挖掘
算法 | Spark算法类 |
---|
FP-growth | FPGrowth |
association rules | AssociationRules |
PrefixSpan | PrefixSpan |