spark
文章平均质量分 66
leibnitz09
这个作者很懒,什么都没留下…
展开
-
[spark-src-core] 2.1 relationships b/t misc spark shells
similar to other open source projects,spark has several shells are listed theresbinserver side shells start-all.shstart the whole spark daemons(ie. start-master.sh,start-slav...原创 2016-06-01 16:01:36 · 114 阅读 · 0 评论 -
[spark-src-core] 6. checkpoint in spark
same as others big data technology,CheckPoint is a well-knowed solution to keep data a snapshot for speeduping failovers,ie. restores to most recent checkpoint state of data ,so u will not need t...原创 2016-10-19 17:14:46 · 125 阅读 · 0 评论 -
[spark-src-core] 7.1 application in spark-PageRank
below code path are all from sparks' example beside some comments are added by me. val lines = ctx.textFile(args(0), 1) //-1 generate links of <src,targets> pair var links = li...2016-11-03 15:59:12 · 131 阅读 · 0 评论 -
spark-spawn a app via spark-shell VS spark-submit
yep,u can submit a app to spark ensemble by spark-submit command ,e.g.spark-submit --master spark://gzsw-02:7077 --class org.apache.spark.examples.JavaWordCount --verbose --deploy-mode client ~...2015-11-25 12:30:36 · 102 阅读 · 0 评论 -
spark-run apps on yarn mode
run on a yarn ensemble is straightforward, 1.setup HADOOP_CONF_DIR u can use command export HADOOP_CONF_DIR=xx or add it to spark-env.sh 2.spark-submit --master yarn --class org....2015-11-25 17:37:18 · 141 阅读 · 0 评论 -
yarn-similar logs when starting up container
15/12/09 16:47:52 INFO yarn.ExecutorRunnable: Setting up executor with environment: Map(CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HO...2015-12-09 17:17:49 · 91 阅读 · 0 评论 -
[spark-src-core] 8. trivial bug in spark standalone executor assignment
yep from [1] we know that spark will divide jobs into two steps to be executed:a.launches executors and b.assigns tasks to that executors by driver.so how do executors are assigned to workers ...原创 2016-11-22 17:24:48 · 105 阅读 · 0 评论 -
spark-RDD vs DataFrame vs DataSet
In summation, the choice of when to use RDD or DataFrame and/or Dataset seems obvious. While the former offers you low-level functionality and control, the latter allows custom view and structure...原创 2016-11-29 15:38:24 · 131 阅读 · 0 评论 -
spark-hive on spark
总体设计Hive on Spark总体的设计思路是,尽可能重用Hive逻辑层面的功能;从生成物理计划开始,提供一整套针对Spark的实现,比如SparkCompiler、SparkTask等,这样Hive的查询就可以作为Spark的任务来执行了。以下是几点主要的设计原则。尽可能减少对Hive原有代码的修改。这是和之前的Shark设计思路最大的不同。Shark对Hive的改动太大...原创 2016-12-06 15:04:03 · 156 阅读 · 0 评论 -
spark-storage/memory used in spark
access pattern in spark storage [1]到目前为止,我们已经了解了spark怎么使用JVM的内存以及集群上执行槽是什么,目前为止还没有谈到task的一些细节,这将在另一个文章中提高,基本上就是spark的一个工作单元,作为exector的jvm进程中的一个线程执行,这也是为什么spark的job启动时间快的原因,在jvm中启...2016-12-12 16:31:20 · 391 阅读 · 0 评论 -
spark-broadcast in spark
go through this block codes below,we will figure out some conclusions:val barr1 = sc.broadcast(arr1) //-broadcast a array with 1M int elements //-this is a embedded broadcast wrapped b...2016-12-22 15:54:25 · 182 阅读 · 0 评论 -
spark stream-Spark Streaming:大规模流式数据处理的新贵
spark stream lineageref: Spark Streaming:大规模流式数据处理的新贵原创 2016-02-24 17:10:06 · 475 阅读 · 0 评论 -
[spark-src]-source reading
base on : spark-1.4.1 hadoop-2.5.2 Base from simpleness to complexity and working flow principle,we conform to these steps:1.[spark-src] spark overview2.[spark-src] core from ...2016-03-20 15:06:25 · 115 阅读 · 0 评论 -
[spark-src-core] 5.big data techniques in spark
there are several nice techniques in spark,eg. in user api side.here will dive into it check how does spark implement them. 1.abstract(functions in RDD)groupfunctionfeature principl...2016-10-12 17:48:38 · 91 阅读 · 0 评论 -
spark-per partition operations
per partition versions of map() and foreach, ref :learning spark原创 2015-11-06 23:46:41 · 93 阅读 · 0 评论 -
spark-compile spark 1.6.1
abstract,spark can be compiled with:maven,sbt,intellj idealref:Spark1.0.0 源码编译和部署包生成 also,if u want to load spark-project into eclipse ,then it is necessary to make a 'eclipse project'...原创 2015-09-14 18:05:08 · 94 阅读 · 0 评论 -
spark-basic demo from book 'learning spark'
after a heavy cost time(primary at download huge number of jars),the first example from book 'learning spark' is run through. the source code is very simple/** * Illustrates flatMap + co...2015-09-22 23:35:30 · 145 阅读 · 0 评论 -
[spark-src-core] 2.2 job submitted flow for local mode-part I
now we will dive into spark internal as per this simple example(wordcount,later articles will reference this one by default) belowsparkConf.setMaster("local[2]") //-local[*] by default//leib-c...原创 2016-08-24 17:36:23 · 157 阅读 · 0 评论 -
[spark-src-core] 2.2 job submitted flow for local mode-part II
in this section,we will verify that how does spark collect data from prevous stage to next stage(result task) figure after finishing ShuffleMapTask computation(ie post process ).note:the l...2016-08-25 11:23:42 · 177 阅读 · 0 评论 -
[spark-src-core] 2.3 shuffle in spark
1.flow1.1 shuffle abstract 1.2 shuffle flow 1.3 sort flow in shuffle 1.4 data structure in mem 2.core code paths //SortShuffleWriteroverride def write(records: Iterat...2016-08-25 16:31:09 · 120 阅读 · 0 评论 -
[spark-src-core] 2.4 communications b/t certain kernal components
1 data flow overview note:-arrow here is means by:bold line is as data line ‘w/o sender and recevier meanings’ but only with data ‘from-to’-two ways to retieve task result:direct result and i...2016-08-25 17:36:14 · 148 阅读 · 0 评论 -
[spark-src-core] 2.5 core concepts in Spark
1.overview in wordcount-memory tips:Job > Stage > Rdd > DependencyRDDs are linked by Dependencies. 2.terms-RDD is associated by Dependency,ie Dependency is a warpper of RDD....2016-08-25 17:38:41 · 105 阅读 · 0 评论 -
[spark-src-core] 3.run spark in cluster(local) mode
yep ,just the same with your guess,there are many deploy modes in spark,eg standalone,yarn,mesos etc.go advance step,the standalone mode can be devided into standalone,cluster(local) mode.the form...2016-09-02 17:53:54 · 201 阅读 · 0 评论 -
spark-common RDD transformations and actions
all figures below are from 'learing-spark',原创 2015-10-20 16:33:27 · 105 阅读 · 0 评论 -
[spark-src-core] 3.2.run spark in standalone(client) mode
1.startup command./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode client --master spark://gzsw-02:7077 lib/spark-examples-1.4.1-hadoop2.4.0.jar hdfs://host02:/user...2016-09-19 11:55:38 · 125 阅读 · 0 评论 -
[spark-src-core] 3.3 run spark in standalone(cluster) mode
simiar to the prevous article,this one is focused on cluster mode.1.issue command./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode cluster --master spark://gzsw...2016-09-19 12:30:17 · 275 阅读 · 0 评论 -
[spark-src-core] 4.2 communications b/t certain kernal components
there are several component entities run as daemons in spark(standalone),know to what/how they are working is necessary indeed. akka msg flow similar to tcp note:register driver =R...2016-09-27 12:26:41 · 111 阅读 · 0 评论 -
[spark-src] 1-overview
what is "Apache Spark™ is a fast and general engine for large-scale data processing....Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk." stated in apache spa...2016-03-20 16:20:31 · 141 阅读 · 0 评论