第一步,用python生成数据:
import numpy as np
from sklearn.datasets import make_blobs
features, labels = make_blobs(n_samples=2000, n_features=100, centers=10)
labels = labels.reshape((len(labels), -1))
data = np.hstack((labels, features))
np.savetxt('/home/fhqplzj/IdeaProjects/DocumentClustering/target/blobs/features.txt', data, delimiter=' ')
一共2000个样本点,100维特征,10个类簇。
第二步,用spark进行聚类:
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.{SparkContext, SparkConf}
/**
* Created by fhqplzj on 16-8-15 at 下午6:33.
*/
object Parser {
val path = "/home/fhqplzj/IdeaProjects/DocumentClustering/target/blobs/features.txt"
def main(args: Array[String]): Unit = {