众所周知,wordcount在大数据中的地位相当于helloworld在各种编程语言中的地位。本文并不分析wordcount的计算方法,而是直接给出代码,目的是为了比较Spark中Java,Python,Scala的区别。
显然,Java写法较为复杂,Python简单易懂,Scala是Spark的原生代码,故即为简洁。
Java完整代码:
import java.util.Arrays;
import java.util.Iterator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
public class wordcount {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("wc");
JavaSparkContext sc = new JavaSparkContext(conf);
//read a txtfile
JavaRDD<String> text = sc.textFile("/home/vagrant/speech.txt");
//split(" ")
JavaRDD<String> words = text.flatMap(new FlatMapFunction<String, String>() {
private static final long serialVersionUID = 1L;
@Override
public Iterator<String> call(String line) throws Exception {
return Arrays.asList(line.split(" ")).iterator();
}
});
//word => (word,1)
JavaPairRDD<String,Integer> counts=words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2(s, 1);
}
}
);
//reduceByKey
JavaPairRDD <String,Integer> results=counts.reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
}
) ;
//print
results.foreach(new VoidFunction<Tuple2<String, Integer>>(){
@Override
public void call(Tuple2<String, Integer> t) throws Exception {
System.out.println("("+t._1()+":"+t._2()+")");
}
});
}
}
Pyspark完整代码:
# Imports the PySpark libraries
from pyspark import SparkConf, SparkContext
# Configure the Spark context to give a name to the application
sparkConf = SparkConf().setAppName("MyWordCounts")
sc = SparkContext(conf = sparkConf)
# The text file containing the words to count (this is the Spark README file)
textFile = sc.textFile('/home/vagrant/speech.txt')
# The code for counting the words (note that the execution mode is lazy)
# Uses the same paradigm Map and Reduce of Hadoop, but fully in memory
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
# Executes the DAG (Directed Acyclic Graph) for counting and collecting the result
for wc in wordCounts.collect():
print(wc)
Scala完整代码:
import org.apache.spark.{SparkContext,SparkConf}
object test{
def main(args:Array[String]){
val sparkConf = new SparkConf().setMaster("local").setAppName("MyWordCounts")
val sc = new SparkContext(sparkConf)
sc.textFile("/home/vagrant/speech.txt").flatMap(_.split(' ')).map((_,1)).reduceByKey(_+_).foreach(println)
}
}
本次分享到此结束,欢迎大家批评与交流~~