MapReduce的局限性:
1)代码繁琐;
2)只能够支持map和reduce方法;
3)执行效率低下;
4)不适合迭代多次、交互式、流式的处理;
框架多样化:
1)批处理(离线):MapReduce、Hive、Pig
2)流式处理(实时): Storm、JStorm
3)交互式计算:Impala
学习、运维成本无形中都提高了很多
前置要求:
1)Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+2)export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
mvn编译命令:
前提:需要对maven有一定的了解(pom.xml)
<properties>
<hadoop.version>2.2.0</hadoop.version>
<protobuf.version>2.5.0</protobuf.version>
<yarn.version>${hadoop.version}</yarn.version>
</properties>
<profile>
<id>hadoop-2.6</id>
<properties>
<hadoop.version>2.6.4</hadoop.version>
<jets3t.version>0.9.3</jets3t.version>
<zookeeper.version>3.4.6</zookeeper.version>
<curator.version>2.6.0</curator.version>
</properties>
</profile>
./build/mvn -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package
#推荐使用
./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0
编译完成后:
spark-$VERSION-bin-$NAME.tgz
spark-2.1.0-bin-2.6.0-cdh5.7.0.tgz
常见的问题:
如果在编译过程中,看到的异常信息不是太明显,看不懂编译命令后加 -X,就能看到更祥细的信息
Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.1.0: Could not find artifact org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.7.0 in central (https://repo1.maven.org/maven2) -> [Help 1]
坑一
pom.xml添加
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
坑二:
export MAVEN_OPTS="${MAVEN_OPTS:--Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m}"
注意点:
有些是阿里云的机器,但是你这机器的内存可能有限,建议VM至少2-4g
VM:最好是8个g
坑三:
如果编译的是scala版本2.10
./dev/change-scala-version.sh 2.10
./dev/change-scala-version.sh 2.11
坑四:
was cached in the local repository,
resoulution will not be reattempted until the update interval of nexus has
1、去仓库把xxx.lastUpdated 文件全部删除,重新执行maven命令
2、编译命令后面 -U
Spark Standalone模式的架构和Hadoop HDFS/YARN很类似的
1 master + n workerspark-env.sh
SPARK_MASTER_HOST=hadoop001SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_INSTANCES=1
假设:
hadoop1 : master
hadoop2 : worker
hadoop4 : worker
...
hadoop10 : worker
slaves:
hadoop2
hadoop3
hadoop4
....
hadoop10
==> start-all.sh 会在 hadoop1机器上启动master进程,在slaves文件配置的所有hostname的机器上启动worker进程
Spark WordCount统计
val file = spark.sparkContext.textFile("file:///home/hadoop/data/wc.txt")
val wordCounts = file.flatMap(line => line.split(",")).map((word => (word, 1))).reduceByKey(_ + _)
wordCounts.collect