Spark SQL, DataFrames and Datasets Guide

https://spark.apache.org/docs/1.6.3/sql-programming-guide.html#sql

Spark SQL中所有功能的入口点是SQLContext类或其派生类。创建SQLContext的方式如下:

JavaSparkContext sc = ...; // An existing JavaSparkContext.
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

通过SQLContext创建DataFrame:

JavaSparkContext sc = ...; // An existing JavaSparkContext.
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

DataFrame df = sqlContext.read().json("examples/src/main/resources/people.json");

// Displays the content of the DataFrame to stdout
df.show();

DataFrame可参考可参考API Documentation。除了简单的列引用和表达式外,DataFrames还具有丰富的函数库,包括字符串处理,日期算术,通用数学运算等。函数库可参考DataFrame Function Reference

通过SQL创建DataFrame:

SQLContext sqlContext = ... // An existing SQLContext
String table = "((SELECT * FROM table where ...) as table)"// 有过滤条件时,这里的括号是必须要加的,不然报错
DataFrame df = sqlContext.sql(table)

DataSet与RDD相似,但是,它们不使用Java或Kryo序列化,而是使用专用的编码器对对象进行序列化。虽然编码器和标准序列化都负责将对象转换为字节,但是编码器是动态生成的代码,并使用一种格式,该格式允许Spark执行许多操作,如过滤,排序和哈希处理,而无需将字节反序列化为对象。

由RDD创建DataFrame的两种方式:通过反射和通过编程接口。

反射方式:

使用反射获得的BeanInfo定义表的架构。当前,Spark SQL不支持包含嵌套或包含复杂类型(例如列表或数组)的JavaBean。可以通过创建一个实现Serializable,且所有字段具有getter和setter的类来创建JavaBean。

public static class Person implements Serializable {
  private String name;
  private int age;

  public String getName() {
    return name;
  }

  public void setName(String name) {
    this.name = name;
  }

  public int getAge() {
    return age;
  }

  public void setAge(int age) {
    this.age = age;
  }
}

通过调用createDataFrame并为JavaBean提供Class对象,可以将模式应用于现有的RDD。

// sc is an existing JavaSparkContext.
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

// Load a text file and convert each line to a JavaBean.
JavaRDD<Person> people = sc.textFile("examples/src/main/resources/people.txt").map(
  new Function<String, Person>() {
    public Person call(String line) throws Exception {
      String[] parts = line.split(",");

      Person person = new Person();
      person.setName(parts[0]);
      person.setAge(Integer.parseInt(parts[1].trim()));

      return person;
    }
  });

// Apply a schema to an RDD of JavaBeans and register it as a table.
DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class);
schemaPeople.registerTempTable("people");

// SQL can be run over RDDs that have been registered as tables.
DataFrame teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
List<String> teenagerNames = teenagers.javaRDD().map(new Function<Row, String>() {
  public String call(Row row) {
    return "Name: " + row.getString(0);
  }
}).collect();

编程方式:

当无法提前定义JavaBean类时,可以通过三个步骤以编程方式创建DataFrame:

1.从原始RDD创建Rows;

2.在第1步中创建的RDD中,创建一个由StructType表示的schema ,该schema 与Rows的结构匹配。

3.通过SQLContext提供的createDataFrame方法将schema 应用于Rows 。

import org.apache.spark.api.java.function.Function;
// Import factory methods provided by DataTypes.
import org.apache.spark.sql.types.DataTypes;
// Import StructType and StructField
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StructField;
// Import Row.
import org.apache.spark.sql.Row;
// Import RowFactory.
import org.apache.spark.sql.RowFactory;

// sc is an existing JavaSparkContext.
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

// Load a text file and convert each line to a JavaBean.
JavaRDD<String> people = sc.textFile("examples/src/main/resources/people.txt");

// The schema is encoded in a string
String schemaString = "name age";

// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<StructField>();
for (String fieldName: schemaString.split(" ")) {
  fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);

// Convert records of the RDD (people) to Rows.
JavaRDD<Row> rowRDD = people.map(
  new Function<String, Row>() {
    public Row call(String record) throws Exception {
      String[] fields = record.split(",");
      return RowFactory.create(fields[0], fields[1].trim());
    }
  });

// Apply the schema to the RDD.
DataFrame peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);

// Register the DataFrame as a table.
peopleDataFrame.registerTempTable("people");

// SQL can be run over RDDs that have been registered as tables.
DataFrame results = sqlContext.sql("SELECT name FROM people");

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
List<String> names = results.javaRDD().map(new Function<Row, String>() {
  public String call(Row row) {
    return "Name: " + row.getString(0);
  }
}).collect();

 

Spark SQL支持通过DataFrame接口对各种数据源进行操作。DataFrame可以作为普通的RDD进行操作,也可以注册为临时表。将DataFrame注册为表后可对数据集执行SQL查询。下面介绍使用Spark数据源加载和保存数据的一般方法,然后介绍了可用于内置数据源的特定选项。

// 使用API默认参数读取文件
DataFrame df = sqlContext.read().format("json").load("examples/src/main/resources/people.json");
// 将数据保存为parquet格式
df.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
DataFrame df = sqlContext.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`");

 

 

 

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值