[2.3]Spark DataFrame操作(二)之通过编程动态完成RDD与DataFrame的转换

参考

Spark官网
DT大数据梦工厂

场景

一、上一篇博客将待分析数据影射成JavaBean的字段,然后通过def createDataFrame(data:java.util.List[_],beanClass:Class[_]):DataFrame完成了RDD与DataFrame的转换(即:Inferring the Schema Using Reflection);今天换一种实现方式-Programmatically Specifying the Schema:构建StructType,通过def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame以完成RDD与DataFrame的转换!
二、Programmatically Specifying the Schema
“When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

1、 Create an RDD of Rows from the original RDD;
2、 Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
3、Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.”

实验

java版
package cool.pengych.spark.sql;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public class RDD2DataFrameProgrammatically
{
    public static void main(String[] args) 
    {
        /*
         * 1、创建SQLContext
         */
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("RDD2DataFrameByReflection");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);

        /*
         * 2、在RDD的基础上创建类型为Row的RDD
         */
        JavaRDD<String> lines = sc.textFile("file:home/pengyucheng/java/rdd2dfram.txt");
        JavaRDD<Row> rows = lines.map(new Function<String, Row>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Row call(String line)
            {
                String[] splited = line.split(",");
                return RowFactory.create(Integer.valueOf(splited[0]),splited[1],Integer.valueOf(splited[2]));
            }
        });
        /*
         * 3、动态构造DataFrame的元数据,一般而言,有多少列以及每列的具体类型可能来自于
         * Json文件,也可能来自于DB
         */
        List<StructField> structFields = new ArrayList<StructField>();
        structFields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
        structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
        structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));
        StructType structType = DataTypes.createStructType(structFields);
        /*
         * 4、基于以后的MetaData以及RDD<Row>来构造DataFrame
         */
        DataFrame personDF =  sqlContext.createDataFrame(rows, structType);
        /*
         * 5、注册临时表供以后SQL使用
         */
        personDF.registerTempTable("persons");
        /*
         * 6、进行数据的多维度分析
         */
        DataFrame results = sqlContext.sql("select * from persons where age > 8");
        /*
         * 7、由DataFrame转换成为RDD
         */
        List<Row> listRows = results.javaRDD().collect();
        for (Row row : listRows)
        {
            System.out.println(row);
        }
    }
}
scala版
package main.scala
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
import org.apache.spark.sql.RowFactory
import org.apache.spark.sql.types.StructField
import java.util.ArrayList
import org.apache.spark.sql.types.DataTypes

object RDD2DataFrameProgrammatically
{
    def main(args: Array[String]): Unit = {
       val conf = new SparkConf().setMaster("local[*]").setAppName("DataFram Ops")
       val sc = new SparkContext(conf)
       val sqlContext = new SQLContext(sc)

       val lines = sc.textFile("file:home/pengyucheng/java/rdd2dfram.txt")
       val rowsRDD = lines.map(line => {
         val splited = line.split(",")
         val row = RowFactory.create(Integer.valueOf(splited(0)),splited(1),Integer.valueOf(splited(2)))
         row
       })     
       val structFields = new ArrayList[StructField]()
       structFields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
        structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
          structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));
       val structType = DataTypes.createStructType(structFields)

       val personDf = sqlContext.createDataFrame(rowsRDD, structType)
       personDf.registerTempTable("persons")

       sqlContext.sql("select * from persons where age > 8").rdd.collect.foreach(println)
    }  
}
执行结果
16/05/26 21:28:20 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
[1,hadoop,11]
[4,ivy,27]
16/05/26 21:28:20 INFO SparkContext: Invoking stop() from shutdown hook

总结

Spark DataFrame处理数据流程图(纯属个人总结,准备性有待后续验证)

这里写图片描述

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值