SparkSQL简单测试

osx + idea15跑单机测试。

环境:
JDK
安装scala2.10.6(注意,这里不能使用2.11.x版本,与Spark1.5.x不兼容)
导入Libraies:spark-assembly-1.5.0-hadoop2.6.0.jar


使用示例

编写简单的scala程序,从文本文件中加载用户数据并从数据集中创建一个DataFrame对象。然后运行DataFrame函数,执行特定的数据选择查询。

文本文件customers.txt中的内容如下:

Tom,12
Mike,13
Tony,34
Lili,8
David,21
Nike,18
Bush,29
Candy,42

编写Scala代码:

import org.apache.spark._

object Hello {

    // 创建一个表示用户的自定义类
    case class Person(name: String, age: Int)

    def main(args: Array[String]) {

        val conf = new SparkConf().setAppName("SparkSQL Demo")
        val sc = new SparkContext(conf)

        // 首先用已有的Spark Context对象创建SQLContext对象
        val sqlContext = new org.apache.spark.sql.SQLContext(sc)

        // 导入语句,可以隐式地将RDD转化成DataFrame
        import sqlContext.implicits._

        // 用数据集文本文件创建一个Person对象的DataFrame
        val people = sc.textFile("/Users/urey/data/input2.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()

        // 将DataFrame注册为一个表
        people.registerTempTable("people")

        // SQL查询
        val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

        // 输出查询结果,按照顺序访问结果行的各个列。
        teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

        sc.stop()
  }
}

Edit Configuration:

-Dspark.master=local

Run Result:

/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/java -Dspark.master=local -Didea.launcher.port=7532 "-Didea.launcher.bin.path=/Applications/IntelliJ IDEA 15.app/Contents/bin" -Dfile.encoding=UTF-8 -classpath "/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/lib/packager.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/lib/tools.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/deploy.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/javaws.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/jfxswt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/management-agent.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/plugin.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/ext/cldrdata.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/ext/jfxrt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/ext/nashorn.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/Users/urey/data/SparkDemo/out/production/SparkDemo:/usr/local/share/scala-2.10.6/lib/scala-actors-migration.jar:/usr/local/share/scala-2.10.6/lib/scala-actors.jar:/usr/local/share/scala-2.10.6/lib/scala-library.jar:/usr/local/share/scala-2.10.6/lib/scala-reflect.jar:/usr/local/share/scala-2.10.6/lib/scala-swing.jar:/Users/urey/Downloads/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar:/Applications/IntelliJ IDEA 15.app/Contents/lib/idea_rt.jar" com.intellij.rt.execution.application.AppMain Demo /hadoopLearning/spark-1.5.0-bin-hadoop2.4/README.md
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/11/15 17:08:51 INFO SparkContext: Running Spark version 1.5.0
15/11/15 17:08:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/15 17:08:52 INFO SecurityManager: Changing view acls to: urey
15/11/15 17:08:52 INFO SecurityManager: Changing modify acls to: urey
15/11/15 17:08:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(urey); users with modify permissions: Set(urey)
15/11/15 17:08:53 INFO Slf4jLogger: Slf4jLogger started
15/11/15 17:08:53 INFO Remoting: Starting remoting
15/11/15 17:08:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.12.24.154:49984]
15/11/15 17:08:53 INFO Utils: Successfully started service 'sparkDriver' on port 49984.
15/11/15 17:08:53 INFO SparkEnv: Registering MapOutputTracker
15/11/15 17:08:53 INFO SparkEnv: Registering BlockManagerMaster
15/11/15 17:08:53 INFO DiskBlockManager: Created local directory at /private/var/folders/24/mfkwkygj31vbpnsfws1063f80000gn/T/blockmgr-bfe8e8b1-abc9-4827-959b-462d4b8211cf
15/11/15 17:08:53 INFO MemoryStore: MemoryStore started with capacity 1966.1 MB
15/11/15 17:08:53 INFO HttpFileServer: HTTP File server directory is /private/var/folders/24/mfkwkygj31vbpnsfws1063f80000gn/T/spark-efbbbd67-b6d4-4175-a0ca-1b744e31af2e/httpd-4c90da26-08ed-4ce6-a422-253690f96177
15/11/15 17:08:53 INFO HttpServer: Starting HTTP Server
15/11/15 17:08:53 INFO Utils: Successfully started service 'HTTP file server' on port 49985.
15/11/15 17:08:53 INFO SparkEnv: Registering OutputCommitCoordinator
15/11/15 17:08:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/11/15 17:08:53 INFO SparkUI: Started SparkUI at http://10.12.24.154:4040
15/11/15 17:08:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/11/15 17:08:53 INFO Executor: Starting executor ID driver on host localhost
15/11/15 17:08:54 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49986.
15/11/15 17:08:54 INFO NettyBlockTransferService: Server created on 49986
15/11/15 17:08:54 INFO BlockManagerMaster: Trying to register BlockManager
15/11/15 17:08:54 INFO BlockManagerMasterEndpoint: Registering block manager localhost:49986 with 1966.1 MB RAM, BlockManagerId(driver, localhost, 49986)
15/11/15 17:08:54 INFO BlockManagerMaster: Registered BlockManager
15/11/15 17:08:55 INFO MemoryStore: ensureFreeSpace(130448) called with curMem=0, maxMem=2061647216
15/11/15 17:08:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 1966.0 MB)
15/11/15 17:08:55 INFO MemoryStore: ensureFreeSpace(14276) called with curMem=130448, maxMem=2061647216
15/11/15 17:08:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 1966.0 MB)
15/11/15 17:08:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49986 (size: 13.9 KB, free: 1966.1 MB)
15/11/15 17:08:55 INFO SparkContext: Created broadcast 0 from textFile at Demo.scala:21
15/11/15 17:08:57 INFO FileInputFormat: Total input paths to process : 1
15/11/15 17:08:57 INFO SparkContext: Starting job: collect at Demo.scala:29
15/11/15 17:08:57 INFO DAGScheduler: Got job 0 (collect at Demo.scala:29) with 1 output partitions
15/11/15 17:08:57 INFO DAGScheduler: Final stage: ResultStage 0(collect at Demo.scala:29)
15/11/15 17:08:57 INFO DAGScheduler: Parents of final stage: List()
15/11/15 17:08:57 INFO DAGScheduler: Missing parents: List()
15/11/15 17:08:57 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[7] at map at Demo.scala:29), which has no missing parents
15/11/15 17:08:57 INFO MemoryStore: ensureFreeSpace(8152) called with curMem=144724, maxMem=2061647216
15/11/15 17:08:57 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.0 KB, free 1966.0 MB)
15/11/15 17:08:57 INFO MemoryStore: ensureFreeSpace(4226) called with curMem=152876, maxMem=2061647216
15/11/15 17:08:57 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.1 KB, free 1966.0 MB)
15/11/15 17:08:57 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49986 (size: 4.1 KB, free: 1966.1 MB)
15/11/15 17:08:57 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
15/11/15 17:08:57 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[7] at map at Demo.scala:29)
15/11/15 17:08:57 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/11/15 17:08:57 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2141 bytes)
15/11/15 17:08:57 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/11/15 17:08:57 INFO HadoopRDD: Input split: file:/Users/urey/data/input2.txt:0+63
15/11/15 17:08:57 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/11/15 17:08:57 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/11/15 17:08:57 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/11/15 17:08:57 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/11/15 17:08:57 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/11/15 17:08:57 INFO GeneratePredicate: Code generated in 117.020103 ms
15/11/15 17:08:57 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2284 bytes result sent to driver
15/11/15 17:08:57 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 210 ms on localhost (1/1)
15/11/15 17:08:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/11/15 17:08:57 INFO DAGScheduler: ResultStage 0 (collect at Demo.scala:29) finished in 0.221 s
15/11/15 17:08:57 INFO DAGScheduler: Job 0 finished: collect at Demo.scala:29, took 0.276326 s
Name: Mike
Name: Nike
15/11/15 17:08:57 INFO SparkUI: Stopped Spark web UI at http://10.12.24.154:4040
15/11/15 17:08:57 INFO DAGScheduler: Stopping DAGScheduler
15/11/15 17:08:57 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/11/15 17:08:57 INFO MemoryStore: MemoryStore cleared
15/11/15 17:08:57 INFO BlockManager: BlockManager stopped
15/11/15 17:08:58 INFO BlockManagerMaster: BlockManagerMaster stopped
15/11/15 17:08:58 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/11/15 17:08:58 INFO SparkContext: Successfully stopped SparkContext
15/11/15 17:08:58 INFO ShutdownHookManager: Shutdown hook called
15/11/15 17:08:58 INFO ShutdownHookManager: Deleting directory /private/var/folders/24/mfkwkygj31vbpnsfws1063f80000gn/T/spark-efbbbd67-b6d4-4175-a0ca-1b744e31af2e

Process finished with exit code 0

SparkSQL与Hive on Spark比较

不同点:

  1. Spark SQL是Spark官方也就是Databricks的项目,原先Databricks主推的是Shark,后来改为Spark SQL。

  2. Hive on Spark属于Apache Hive产品线,是Hive on MapReduce演进而来,使得生成Spark Job而不是MR Job,充分利用Spark的快速执行能力来缩短HiveQL的响应时间。Hive on Spark现在是Hive组件(从Hive1.1 release之后)的一部分,Cloudera主导启动了Hive On Spark,并且该项目得到了IBM,Intel和MapR的支持(但是没有Databricks)。

相同点:

  1. 两个产品都是处于上面的 "翻译层" ,把一个SQL翻译成分布式的可执行的Spark Job。
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值