Spark
文章平均质量分 62
商俊超
大数据猿人。努力不被开猿节流!
展开
-
Spark的五种Join策略
JOIN操作是非常常见的数据处理操作,Spark作为一个统一的大数据处理引擎,提供了非常丰富的JOIN场景。本文分享将介绍Spark所提供的5种JOIN策略,希望对你有所帮助。本文主要包括以下内容:影响JOIN操作的因素 Spark中JOIN执行的5种策略 Spark是如何选择JOIN策略的影响JOIN操作的因素数据集的大小参与JOIN的数据集的大小会直接影响Join操作的执行效率。同样,也会影响JOIN机制的选择和JOIN的执行效率。JOIN的条件JOIN的条件会涉及字段之间的逻转载 2021-01-07 21:49:16 · 537 阅读 · 0 评论 -
Spark 运行模式(StandAlone Yarn)
1.StandAlone 模式1.1 Client 模式流程:Client模式Drive运行在Client上,使用Spark Shell提交任务的时候,Drive运行在Master上1.Spark Context 连接到Master,并向Master注册申请资源2.Master根据sc提出的申请,检测Worker的心跳,并找到有资源的Worker,并在Worker上启动Executor,3.启动Executor的Worker机器向SC注册4.SC将应用分配给Executor原创 2021-01-07 17:41:21 · 213 阅读 · 1 评论 -
Spark ShuffleManger
Spark Shuffle演变史Spark 0.8及以前 Hash Based ShuffleSpark 0.8.1 为Hash Based Shuffle引入File Consolidation机制Spark 0.9 引入ExternalAppendOnlyMapSpark 1.1 引入Sort Based Shuffle,但默认仍为Hash Based ShuffleSpark 1.2 默认的Shuffle方式改为Sort Based ShuffleSpark 1.4 引入Tu原创 2021-01-07 16:27:40 · 107 阅读 · 0 评论 -
Spark ShuffleWriter的三种方式
SortShuffleWriter1.如果没有局部聚合且分区数小于spark.shuffle.sort.bypassMergeThresheld=200 则会使用BypassMergeSorteShuffleWriter2.如果没有聚不聚和,分区数小于16777216并且Serializer支持relocation则使用UnsafeShuffleWriter3.如果有局部聚合也支持排序操作则使用SortShuffleWriter不同shuffleWrite的实现细节1.Bypas原创 2021-01-07 15:39:43 · 434 阅读 · 0 评论 -
Spark-SQL 使用SQL和DSL 统计用户上网流量 案例
需求分析:统计用户上网流量,如果两次上网的时间小于10分钟,就可以rollup到一起uid,start_time,end_time,flow1,2020-02-18 14:20:30,2020-02-18 14:46:30,201,2020-02-18 14:47:20,2020-02-18 15:20:30,301,2020-02-18 15:37:23,2020-02-18 16:05:26,401,2020-02-18 16:06:27,2020-02-18 17:20:49,50.原创 2021-01-06 23:54:42 · 246 阅读 · 0 评论 -
Spark-SQL 使用SQL和DSL 计算店铺的与销售额和累加到当前月的销售和
计算店铺的与销售额和累加到当前月的销售和数据:sid,dt,moneyshop1,2019-01-18,500shop1,2019-02-10,500shop1,2019-02-10,200shop1,2019-02-11,600shop1,2019-02-12,400shop1,2019-02-13,200shop1,2019-02-15,100shop1,2019-03-05,180shop1,2019-04-05,280shop1,2019-04-06,220shop2,.原创 2021-01-05 23:28:49 · 283 阅读 · 0 评论 -
Spark-SQL 使用SQL和DSL 用户连续登录 案例
guid01,2018-02-28guid01,2018-02-28guid01,2018-03-01guid01,2018-03-02guid01,2018-03-05guid01,2018-03-05guid01,2018-03-04guid01,2018-03-06guid01,2018-03-07guid02,2018-03-01guid02,2018-03-02guid02,2018-03-03guid02,2018-03-06guid02,2018-03-02gu..原创 2021-01-05 09:32:13 · 221 阅读 · 0 评论 -
Spark-SQL 读写Orc 文件
读文件 import org.apache.spark.sql.{DataFrame, SparkSession}//通过csv文件创建DataFrameobject CreateDataFrameFromOrc { def main(args: Array[String]): Unit = { //创建SparkSession(是对SparkContext的包装和增强) val spark: SparkSession = SparkSession.builder()..原创 2021-01-04 20:12:06 · 3105 阅读 · 0 评论 -
Spark-SQL 读写Parquet文件
读Parquet格式wenjian import org.apache.spark.sql.{DataFrame, SparkSession}object CreateDataFrameFromParquet { def main(args: Array[String]): Unit = { //创建SparkSession(是对SparkContext的包装和增强) val spark: SparkSession = SparkSession.builder() ...原创 2021-01-04 20:09:52 · 842 阅读 · 0 评论 -
Spark-SQL 读写jdbc
读jdbc中的信息 import java.util.Propertiesimport org.apache.spark.sql.{DataFrame, SparkSession}object CreateDataFrameFromJDBC { def main(args: Array[String]): Unit = { //创建SparkSession val spark = SparkSession.builder() .appName(this....原创 2021-01-04 20:00:25 · 736 阅读 · 0 评论 -
Spark-SQL 读写csv文件
name,age,fv_valuelibai,18,9999.99xuance,30,99.99diaochan,28,99.99libai,18,9999.99xuance,30,99.99diaochan,28,99.99 读csv文件 import org.apache.spark.sql.{DataFrame, SparkSession}object CreateDataFrameFromCsv { def main(args: Array[Strin..原创 2021-01-04 19:40:54 · 2262 阅读 · 0 评论 -
Spark-SQL 读写json文件
{"name": "libai", "age": 30, "fv": 99.99}{"name": "xiaoqiao", "age": 28, "fv": 9.99}{"name": "yasuo", "age": 18, "fv": 80.99, "gender": "male"}{"name": "banzang", "age": 18, "fv": 9999.99}{"name": "saisi", "fv": 9999.98, "gender": "female"}{"name": ..原创 2021-01-04 18:11:04 · 750 阅读 · 0 评论 -
Spark-sql 通过case class、class和StructType的方式创建DataFrame
1.通过case class 的方式创建DataFramelaozhao,18,9999.99laoduan,30,99.99xuance,28,99.99yeqing,25,99.0dezhi,24,99.9libai,88,50.0banzang,29,50.6import org.apache.spark.SparkContextimport org.apache.spark.rdd.RDDimport org.apache.spark.sql.{DataFrame,.原创 2021-01-04 17:55:33 · 3428 阅读 · 1 评论 -
Spark On Yarn
1.配置Hadoop①需要在/etc/profile中配置HADOOP_CONF_DIR的目录,目的是为了让Spark找到core-site.xml、hdfs-site.xml和yarn-site.xml【让spark知道NameNode、ResourceManager】,不然会包如下错误:Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YA原创 2021-01-04 14:56:13 · 544 阅读 · 0 评论 -
Spark-RDD 统计用户上网流量连续上网案例
1,2020-02-18 14:20:30,2020-02-18 14:46:30,201,2020-02-18 14:47:20,2020-02-18 15:20:30,301,2020-02-18 15:37:23,2020-02-18 16:05:26,401,2020-02-18 16:06:27,2020-02-18 17:20:49,501,2020-02-18 17:21:50,2020-02-18 18:03:27,602,2020-02-18 14:18:24,2020-02-.原创 2021-01-04 14:41:41 · 282 阅读 · 0 评论 -
Spark-RDD 店铺销售额累加案例
shop1,2019-01-18,500shop1,2019-02-10,500shop1,2019-02-10,200shop1,2019-02-11,600shop1,2019-02-12,400shop1,2019-02-13,200shop1,2019-02-15,100shop1,2019-03-05,180shop1,2019-04-05,280shop1,2019-04-06,220shop2,2019-02-10,100shop2,2019-02-11,100sho.原创 2021-01-04 14:37:47 · 316 阅读 · 0 评论 -
Spark-RDD 连续登录多天案例
guid01,2018-02-28guid01,2018-02-28guid01,2018-03-01guid01,2018-03-02guid01,2018-03-05guid01,2018-03-05guid01,2018-03-04guid01,2018-03-06guid01,2018-03-07guid02,2018-03-01guid02,2018-03-02guid02,2018-03-03guid02,2018-03-06guid02,2018-03-02gu..原创 2021-01-04 14:36:15 · 209 阅读 · 0 评论 -
Spark-使用Scala和java编写spark的WordCount案例 在本地测试代码
1.使用sparkshell编写1.启动spark-shell[root@linux01 spark-3.0.1-bin-hadoop3.2]# ./bin/spark-shell --master spark://linux01:70772.编写shell语句scala> sc.textFile("hdfs://linux01:8020/data").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2原创 2020-12-21 23:19:36 · 476 阅读 · 0 评论