贝叶斯 spark-mlib(翻译)

贝叶斯 spark-mlib

贝叶斯是一个简单的多分类算法,它假设各个特征是独立的。贝叶斯算法可以被有效的训练。在训练数据的过程中,它计算每个特征标签的条件概率分布,然后通过贝叶斯定理来计算法特征标签的条件概率,并用它来预测概率。
spark-mlib支持 multinomial naive BayesBernoulli naive Bayes。这些模型通常用于 document classification。预测一个文档,每个特征值作为一个项,那么值就是词的频率(在multinomial naive Bayes中)或这个词在文档中出现0次或者一次(在Bernoulli naive Bayes中)。特征值必须是非负值。模型的类型可以选着多项式或伯努利多项式作为默认的类别。Additive smoothing能够通过λ来设置(默认值是1.0)。对于文档分类,输入特征向量通常是稀疏矩阵,稀疏矩阵作为输入能带来计算上的优势。由于训练用的数据只会被利用一次,因此不需要缓存这些数据。

Scala

NaiveBayes实现多项朴素贝叶斯。它需要一个LabeledPoint的RDD和一个可选的抽样平滑参数lambda作为输入,一个可选的模型类型参数(缺省为lambda),和输出NaiveBayesModel,可用于评价和预测。
可以参考文档 NaiveBayes Scala docsNaiveBayesModel Scala docs

import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}

// Split data into training (60%) and test (40%).
val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)

val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()

// Save and load model
model.save(sc, "target/tmp/myNaiveBayesModel")
val sameModel = NaiveBayesModel.load(sc, "target/tmp/myNaiveBayesModel")  

完整的代码在源码中的examples/src/main/scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala

Java

NaiveBayes实现多项朴素贝叶斯。它需要一个LabeledPoint的Scala的RDD和一个可选的抽样平滑参数lambda作为输入,一个可选的模型类型参数(缺省为lambda),和输出NaiveBayesModel,可用于评价和预测。
可以参考文档NaiveBayes Java docsNaiveBayesModel Java docs

import scala.Tuple2;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;

String path = "data/mllib/sample_naive_bayes_data.txt";
JavaRDD<LabeledPoint> inputData = MLUtils.loadLibSVMFile(jsc.sc(), path).toJavaRDD();
JavaRDD<LabeledPoint>[] tmp = inputData.randomSplit(new double[]{0.6, 0.4}, 12345);
JavaRDD<LabeledPoint> training = tmp[0]; // training set
JavaRDD<LabeledPoint> test = tmp[1]; // test set
final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
JavaPairRDD<Double, Double> predictionAndLabel =
  test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
    @Override
    public Tuple2<Double, Double> call(LabeledPoint p) {
      return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
    }
  });
double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
  @Override
  public Boolean call(Tuple2<Double, Double> pl) {
    return pl._1().equals(pl._2());
  }
}).count() / (double) test.count();

// Save and load model
model.save(jsc.sc(), "target/tmp/myNaiveBayesModel");
NaiveBayesModel sameModel = NaiveBayesModel.load(jsc.sc(), "target/tmp/myNaiveBayesModel");

完整的代码在源码中examples/src/main/java/org/apache/spark/examples/mllib/JavaNaiveBayesExample.java

Python

NaiveBayes实现多项朴素贝叶斯。它需要一个LabeledPoint的RDD和一个可选的抽样平滑参数lambda作为输入,一个可选的模型类型参数(缺省为lambda),和输出NaiveBayesModel,可用于评价和预测。
注意 Python API 目前还不支持模型的save/load。
可以参考文档 NaiveBayes Python docs NaiveBayesModel Python docs

from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint


def parseLine(line):
    parts = line.split(',')
    label = float(parts[0])
    features = Vectors.dense([float(x) for x in parts[1].split(' ')])
    return LabeledPoint(label, features)

data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)

# Split data aproximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed=0)

# Train a naive Bayes model.
model = NaiveBayes.train(training, 1.0)

# Make prediction and test accuracy.
predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()

# Save and load model
model.save(sc, "target/tmp/myNaiveBayesModel")
sameModel = NaiveBayesModel.load(sc, "target/tmp/myNaiveBayesModel")from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint


def parseLine(line):
    parts = line.split(',')
    label = float(parts[0])
    features = Vectors.dense([float(x) for x in parts[1].split(' ')])
    return LabeledPoint(label, features)

data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)

# Split data aproximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed=0)

# Train a naive Bayes model.
model = NaiveBayes.train(training, 1.0)

# Make prediction and test accuracy.
predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()

# Save and load model
model.save(sc, "target/tmp/myNaiveBayesModel")
sameModel = NaiveBayesModel.load(sc, "target/tmp/myNaiveBayesModel")  

完整的代码在源码中的examples/src/main/python/mllib/naive_bayes_example.py

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值