关于分治算法和Map Reduce的原理:算法实验2:分而治之——修身齐家编算法;
The full official tutorial of Spark programming could be found in https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds
本文持续更新ing...
安装Hadoop
参考:https://www.cnblogs.com/wuxun1997/p/6847950.html
Win环境下Hadoop的bin工具下载:https://github.com/Qinzixin/winutils
安装完成后,localhost:8080
可见:
踩坑:jps找不到Data Node,打不开localhost:50070
出现如下报错:java.lang.UnsatisfiedLinkError…
问题原因:Java必须是64位的
解决方式:重新装Java环境,并修改hadoop_env.cmd
中的JAVA_HOME
路径。
重新安装之后:
安装Scala
安装scala插件:
https://www.jetbrains.com/help/idea/2017.1/creating-and-running-your-scala-application.html
https://www.jetbrains.com/help/idea/2017.1/enabling-and-disabling-plugins.html
直接在应用市场安装会network error,所以需要自行下载:
https://plugins.jetbrains.com/plugin/1347-scala/versions/stable
先打开IDEA的目的是获取匹配的scala版本号,之后根据这个教程手动导入即可:
https://www.cnblogs.com/zhaojinyan/p/9524296.html
安装Spark
必须按照Hadoop, Scala, Spark的顺序安装
必须使用64位的JDK
如果用IDEA,必须使用付费版(Ultimate)
安装教程参考:https://blog.csdn.net/haijiege/article/details/80775792
在Windows下单机模式跑,需要修改命令行参数和源码中NativeIO类的定义
配置工程
直接导入spark的jar包;统一所有的JDK版本;如果org.apache.spark.examples找不到就复制到记事本再粘贴回去(编码错误)
修改log4j.properties
文件,放到src目录下,消除INFO输出。
Why Spark?
Spark 是一种基于内存的快速、通用、可扩展的大数据分析计算引擎。
Hadoop | Spark | |
---|---|---|
发源于 | Yahoo | Berkley |
语言 | Java | Scala |
组件 | MapReduce, Hbase, HDFS | Spark Core, Spark SQL, Spark Stream |
应用场景 | 基于MapReduce,适用于循环迭代类型数据处理 | 针对机器学习算法,进行针对性计算优化,单元缩小到RDD模型 |
多个作业之间的数据通信问题 | 基于磁盘 | 基于内存 |
在实际的生产环境中,由于内存的限制,可能会由于内存资源不够导致 Job 执行失败,此时,MapReduce 其实是一个更好的选择,所以Spark并不能完全替代 MapReduce
初始化Spark
Spark程序需要做的第一件事情,就是创建一个SparkContext对象,它将告诉spark如何访问一个集群。
而要创建一个SparkContext对象,你首先要创建一个SparkConf对象,该对象访问了你的应用程序的信息(配置)。
https://blog.csdn.net/tanggao1314/article/details/51570452/
package org.apache.spark.examples;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;
import java.util.List;
public final class JavaSparkPi {
public static void main(String[] args) {
SparkConf conf=new SparkConf(); //创建spark配置
conf.set("spark.testing.memory", "2147480000"); //因为jvm无法获得足够的资源,或者在VM编译选项中添加'-Xms256m -Xms1024m'
JavaSparkContext sc = new JavaSparkContext("local", "First Spark App",conf); //在本机环境(context)下创建spark
System.out.println(sc);
}
}
Spark的核心:RDD (Resilient Distributed Datasets)
RDD:spark抽象的分布式数据集,分布在集群的节点上,key通过函数式的操作进行并行计算。
- 只读,不可以修改,但是可以转换(transform)成另一个新的RDD
- 分布式
- 弹性:如果内存不够,会和磁盘进行数据交换
RDD的分类
- 转换(transform):把一个数据集转成另外一个数据集
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
- action:发出信号,让转换开始计算
基本操作
注:为了显示方便,我们用#来注释,粘贴时请改为//
并行化数组
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data,10); # on 10 CPUs
读入文件
JavaRDD<String> distFile = sc.textFile("data.txt"); # 支持通配符
简单的Map和reduce示例
Lambda
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length()); # Map, which is Lazy, and will NOT be stored,and cannot be printed
lineLengths.persist(StorageLevel.MEMORY_ONLY()); # 持久化
int totalLength = lineLengths.reduce((a, b) -> a + b); # reduce
如果报错,说明编译的level不对,不支持lambda表达式:https://blog.csdn.net/HCZ_hhh/article/details/115536405
mapToPair
JavaRDD<String> lines = sc.textFile("data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
去掉多余的输出信息
把修改后的log4j.properties
文件放到src
目录下面,之后把src在IDEA中设置为Mark as source root.
示例:文本统计
中文的还不行…
import javafx.util.Pair;
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.sources.In;
import org.apache.spark.storage.StorageLevel;
import scala.Tuple2;
import java.util.Arrays;
import java.util.List;
public class Test {
public static void main(String[] args) {
SparkConf conf=new SparkConf(); //创建spark配置
conf.set("spark.testing.memory", "2147480000"); //因为jvm无法获得足够的资源
JavaSparkContext sc = new JavaSparkContext("local", "First Spark App",conf); //在本机环境(context)下创建spark
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<String> words = lines.flatMap(t -> Arrays.asList(t.split(" ")).iterator());
JavaPairRDD<String, Integer> wordAndOne = words.mapToPair(word -> new Tuple2<>(word, 1));
JavaPairRDD<String, Integer> result = wordAndOne.reduceByKey((a, b) -> a + b);
JavaPairRDD<Integer, String> beforeSwap = result.mapToPair(tp -> tp.swap());
JavaPairRDD<Integer, String> sorted = beforeSwap.sortByKey(false);
JavaPairRDD<String, Integer> finalRes = sorted.mapToPair(tp -> tp.swap());
List<Tuple2<String, Integer>> list = finalRes.collect();
for(Tuple2<String,Integer> l:list)
System.out.println(l);
}
}
RDD 操作大全
Map
- 拆分:flatMap(line -> Arrays.asList(line.split(","))).iterator()
- 增加数值:map(x, x+y)
- 增加属性:mapToPair(item -> new Tuple2<>(item_1,item._2))
mapToPair
也可以map到常数,然后根据数据的性质去做reduce. - 外部变量的广播:broadcast()
Reduce
- 过滤:filter(num->num%2==0)
- 对key进行分组:groupByKey(func)
- 对key相同的进行操作:reduceByKey((a, b) -> a + b)
- 收集全部:collect().forEach(System.out::println)