spark
adream307
这个作者很懒,什么都没留下…
展开
-
[spark]Spark UDT with Codegen UDF
本文介绍自定义一种数据类型Point,并针对Point实现Add操作,并且该Add操作在codegen中实现 build.sbt name := "PointUdt" version := "0.1" scalaVersion := "2.12.11" libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.0-pr...原创 2020-05-06 15:18:38 · 517 阅读 · 0 评论 -
[spark]使用injectOptimizerRule改写Plan
自定义UDF函数如下 spark.udf.register("inc", (x: Long) => x + 1) 测试语句如下 val df = spark.sql("select sum(inc(vals)) from data") df.explain(true) df.show() 上述测试语句输出的LogicalPlan如下 == Optimized Logical Plan =...原创 2020-01-22 14:05:12 · 439 阅读 · 0 评论 -
[spark]RewriteDistinctAggregates
如果Aggregate操作中同时包含Distinct与非Distinct操作,优化器可以将该操作改成成两个不包含Distinct的Aggregate 假设schema如下 create table animal(gkey varchar(128), cat varchar(128), dog varchar(128...原创 2020-01-21 14:37:06 · 322 阅读 · 0 评论 -
[spark]Rewrite SparkSQL Plan
OptPlanTest.scala import org.apache.spark.sql.SparkSession import org.apache.log4j.Logger import org.apache.log4j.Level package org.apache.spark.sql.optplan { import org.apache.spark.rdd.RDD imp...原创 2020-01-20 15:53:48 · 350 阅读 · 0 评论 -
[spark]非udf的自定义函数
参考spark的内置函数,实现非udf的自定义函数 MyAdd.scala import org.apache.spark.sql.SparkSession import org.apache.log4j.Logger import org.apache.log4j.Level package org.apache.spark.sql.myfunctions { import org.ap...原创 2020-01-10 16:49:51 · 237 阅读 · 0 评论 -
[spark]单机下的集群模式运行
现有一台服务器,配置信息如下: cpu lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per c...原创 2020-01-09 16:17:31 · 399 阅读 · 0 评论 -
[Spark]调用RDD[InternalRow]的filter方法过滤csv文件
import org.apache.spark.sql.SparkSession object SqlExample { def main(args: Array[String]): Unit = { val spark = SparkSession .builder() .master("local") .appName("Spark sql ...原创 2019-12-30 10:24:44 · 1115 阅读 · 1 评论 -
[Scala]对特定对象实例的方法重写
scala支持在对象实例化后对对象内的特定方法重写,重写只会影响当前示例,对其它示例没有影响,测试代码如下 object OverrideTest { class A { def print(): Unit = { println("in A.print") } } def main(args: Array[String]): Unit = { v...原创 2019-12-28 15:27:40 · 392 阅读 · 0 评论 -
[Spark]直接调用RDD的方式实现SparkSQL的Filter操作
使用SQL实现数据过滤 import org.apache.spark.sql.SparkSession object SqlExample { def main(args: Array[String]): Unit = { val spark = SparkSession .builder() .appName("Spark sql whole stage ...原创 2019-12-27 18:59:58 · 1562 阅读 · 0 评论 -
[Spark]自定义RDD的计算函数
MyRDDTest.scala package org.apache.spark.myrdd { import org.apache.spark.{Partition, SparkContext, TaskContext} import scala.reflect.ClassTag import org.apache.spark.rdd._ private[myrdd] cl...原创 2019-12-27 14:43:16 · 911 阅读 · 0 评论 -
[spark]RDD合并
将spark的两个rdd合并成一个rdd scala> val rdd1 = sc.parallelize(1 to 10) rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd1.collect res0: Arra...原创 2019-12-26 16:25:52 · 3921 阅读 · 0 评论 -
[Spark]自定义RDD
scala源程序 //MyRDDTest.scala package org.apache.spark.myrdd { import org.apache.spark.{Partition, SparkContext, TaskContext} import scala.reflect.ClassTag import org.apache.spark.rdd._ private...原创 2019-12-27 10:20:47 · 453 阅读 · 0 评论