Spark Sql和DataFrame指南(部分)

有一个链接

sql-programming-guide


简单翻译几个要点:

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

翻译:

Spark SQL是Spark中一个处理结构化数据的模块。它提供了被称为DataFrames的编程抽象,并能作为一个分布式的SQL查询引擎。


A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

翻译:

DataFrame是一个由命名列组成的分布式数据集。它从概念上讲相当于关系数据库里的一张表,或R/Python里的数据框架,但内部有很多优化。

DataFrame能通过广泛的数据源构建,比如:结构化数据文件,Hive数据表,外部数据库或已有的RDDs。


With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data sources.

翻译:

使用SQLContext对象,应用程序能从已有RDD、从Hive数据表或从其它数据源创建DataFrame。


DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, and Python.

翻译:

DataFrame为结构化数据提供一个特定领域的语言,操作Scala, Java或Python。


The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.

翻译:

SQLContext的sql方法使应用程序能以编程方式运行SQL查询,并返回结果作为一个DataFrame


Parquet is a columnar format that is supported by many other data processing systems.Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.

翻译:

Parquet是列结构的,被很多其它数据处理系统支持。

Spark SQL为读写Parquet文件提供支持,它自动保存原始数据的图表。


然后有一段AMPCamp的测试代码,如下:

import org.apache.spark.{SparkConf, SparkContext}

object TestDataFrameAndSql {

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("SparkPi").setMaster("local")
    val sc = new SparkContext(conf)

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")

    val wikiData = sqlContext.parquetFile("/home/hadoop/AMPCamp/ampcamp/data/wiki_parquet")

    val count = wikiData.count()
    println("count is " + count)  // count is 39365

    wikiData.registerTempTable("wikiData")
    val countResult = sqlContext.sql("SELECT COUNT(*) FROM wikiData").collect()
    println("countResult is " + countResult)  // countResult is [Lorg.apache.spark.sql.Row;@7fdacda0

    val sqlCount = countResult.head.getLong(0)
    println("sqlCount is " + sqlCount)  // sqlCount is 39365

    sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)
//    [Waacstats,2003]
//    [Cydebot,949]
//    [BattyBot,939]
//    [Yobot,890]
//    [Addbot,853]
//    [Monkbot,668]
//    [ChrisGualtieri,438]
//    [RjwilmsiBot,387]
//    [OccultZone,377]
//    [ClueBot NG,353]

    sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE text LIKE '%california%' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)
//    [,179]
//    [BattyBot,42]
//    [Waacstats,37]
//    [RjwilmsiBot,30]
//    [Monkbot,26]
//    [Yobot,22]
//    [Bender235,20]
//    [Cydebot,19]
//    [ClueBot NG,16]
//    [Bgwhite,11]
  }
}


其中wiki_parquet是一个目录,结构如下:

[hadoop@localhost wiki_parquet]$ ls -rlth
total 239M
-rwxrwxrwx. 1 hadoop hadoop  24M Jun 19  2014 part-r-4.parquet
-rwxrwxrwx. 1 hadoop hadoop  25M Jun 19  2014 part-r-2.parquet
-rwxrwxrwx. 1 hadoop hadoop  24M Jun 19  2014 part-r-1.parquet
-rwxrwxrwx. 1 hadoop hadoop  24M Jun 19  2014 part-r-3.parquet
-rwxrwxrwx. 1 hadoop hadoop  24M Jun 19  2014 part-r-5.parquet
-rwxrwxrwx. 1 hadoop hadoop  24M Jun 19  2014 part-r-6.parquet
-rwxrwxrwx. 1 hadoop hadoop  25M Jun 19  2014 part-r-7.parquet
-rwxrwxrwx. 1 hadoop hadoop  24M Jun 19  2014 part-r-8.parquet
-rwxrwxrwx. 1 hadoop hadoop  25M Jun 19  2014 part-r-9.parquet
-rwxrwxrwx. 1 hadoop hadoop    0 Jun 19  2014 _SUCCESS
-rwxrwxrwx. 1 hadoop hadoop  25M Jun 19  2014 part-r-10.parquet
-rwxrwxrwx. 1 hadoop hadoop 3.1K Jun 19  2014 _metadata

然后执行结果注释里有给出。

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值