Spark每日半小时（23）——SparkSQL概览及入门

最新推荐文章于 2023-03-03 16:43:00 发布

DK_ing

最新推荐文章于 2023-03-03 16:43:00 发布

阅读量299

点赞数

分类专栏： # Spark每日半小时

本文链接：https://blog.csdn.net/DK_ing/article/details/93307055

版权

Spark每日半小时专栏收录该内容

38 篇文章 7 订阅

订阅专栏

上面讲了Spark结构化数据的一些点，今天就正式走入Spark官方文档的学习了。发现由于版本问题，老书真的是不能看了。今天正好根据Spark变成指南的SparkSQL开始进入正题，嗯嗯，很合适。

概览

SparkSQL是用于结构化数据处理的Spark模块。与基本的Spark RDD API不同，Spark SQL提供的接口为Spark提供了有关数据结构和正在执行的计算的更多信息。在内部，Spark SQL使用此额外信息来执行额外的优化。有几种与Spark SQL交互的方法，包括SQL和Dataset API。在计算结果时，使用相同的执行引擎，与所用API/语言无关。

SQL

Spark SQL的一个用途是执行SQL查询。Spark SQL还可用于从现有Hive库中读取数据。一般情况下，我们使用Dataset/DataFrame对数据进行处理。

Datasets and DataFrames

数据集是分布式数据集合。数据集是Spark1.6添加的一个新接口，它继承了RDD的优势和Spark SQL优化执行引擎的优点。数据集可以被构造从JVM对象，然后使用功能性的转换操作（map，flatMap，filter等）。数据集API在Java中可用。

DataFrame是一个由命名列组成的数据集。它在概念上等同于关系数据库中的表或者Python中的数据框，并对齐进行的优化。DataFrame可以用多种来源构建，例如：结构化数据文件，Hive中的表，外部数据库或现有RDD。

DataFrame API在Scala，Java，Python和R中可用。在Scala和Java中，DataFrame由Rows的数据集表示。在Scala API中，DataFrame它只是一个类型别名Dataset[Row]。而在Java API中，用户需要使用Dataset<Row>来表示DataFrame。

入门

起点：SparkSession

Spark中所有功能的入口点都是SparkSession类。要船舰基本的SparkSession，只需要使用SparkSession.builder()

SparkSession spark = SparkSession
        .builder()
        .appName("Java Spark SQL basic example")
        .config("spark.some.config.option", "some-value")
        .getOrCreate();

Spark2.0中SparkSession提供了对Hive功能的内置支持，包括使用HiveQL编写查询，访问Hive UDF以及从Hive表读取数据的功能。要使用这些功能，则无需再做Hive设置。

创建：DataFrame

使用SparkSession，应用程序可以从现有的RDD，Hive表或Spark数据源创建DataFrame。下例，基于JSON文件的内容创建DataFrame：

        // $example on:create_df$
        Dataset<Row> df = spark.read().json("people.json");

        // Displays the content of the DataFrame to stdout
        df.show();
        // +----+-------+
        // | age|   name|
        // +----+-------+
        // |null|Michael|
        // |  30|   Andy|
        // |  19| Justin|
        // +----+-------+
        // $example off:create_df$

无类型数据集操作（又名DataFrame操作）

DataFrames为Java中的结构化数据操作提供了特定域的语言。

如上述，在Spark2.0中，DataFrames只是Row在Java API中的数据集。与“类型转换”相比，这些操作也称为“无类型转换”，带有强类型Java数据集。

下面来看一个示例：

        // $example on:untyped_ops$
        // Print the schema in a tree format
        df.printSchema();
        // root
        // |-- age: long (nullable = true)
        // |-- name: string (nullable = true)

        // Select only the "name" column
        df.select("name").show();
        // +-------+
        // |   name|
        // +-------+
        // |Michael|
        // |   Andy|
        // | Justin|
        // +-------+

        // Select everybody, but increment the age by 1
        df.select(col("name"), col("age").plus(1)).show();
        // +-------+---------+
        // |   name|(age + 1)|
        // +-------+---------+
        // |Michael|     null|
        // |   Andy|       31|
        // | Justin|       20|
        // +-------+---------+

        // Select people older than 21
        df.filter(col("age").gt(21)).show();
        // +---+----+
        // |age|name|
        // +---+----+
        // | 30|Andy|
        // +---+----+

        // Count people by age
        df.groupBy("age").count().show();
        // +----+-----+
        // | age|count|
        // +----+-----+
        // |  19|    1|
        // |null|    1|
        // |  30|    1|
        // +----+-----+
        // $example off:untyped_ops$

有关可以在数据集上执行的操作类型的完整列表，我们可以参考官方的API文档——这里给个友情小链接。

除了简单的列引用和表达式之外，数据集还具有丰富的函数库，包括字符串操作，日期算术，常用数学运算等。完整列表可在DataFrame函数参考中找到。

以编程方式运行SQL查询

SparkSession在Sql上的功能使应用程序可以以编程的方式运行SQL查询并返回结果为Dataset<Row>类型的数据集。

        // $example on:run_sql$
        // Register the DataFrame as a SQL temporary view
        df.createOrReplaceTempView("people");

        Dataset<Row> sqlDF = spark.sql("SELECT * FROM people");
        sqlDF.show();
        // +----+-------+
        // | age|   name|
        // +----+-------+
        // |null|Michael|
        // |  30|   Andy|
        // |  19| Justin|
        // +----+-------+
        // $example off:run_sql$

全局临时视图

Spark SQL中的临时视图是会话范围的，如果创建它的会话终止，它将消失。如果希望拥有一个在所有会话之间共享的临时视图并保持活动状态，知道Spark应用程序终止，我们可以创建一个全局临时视图。全局临时视图与系统保留的数据库绑定global_temp，我们必须使用限定名称来引用它，例如“select * from global_temp.view1”。

        // $example on:global_temp_view$
        // Register the DataFrame as a global temporary view
        df.createGlobalTempView("people");

        // Global temporary view is tied to a system preserved database `global_temp`
        spark.sql("SELECT * FROM global_temp.people").show();
        // +----+-------+
        // | age|   name|
        // +----+-------+
        // |null|Michael|
        // |  30|   Andy|
        // |  19| Justin|
        // +----+-------+

        // Global temporary view is cross-session
        spark.newSession().sql("SELECT * FROM global_temp.people").show();
        // +----+-------+
        // | age|   name|
        // +----+-------+
        // |null|Michael|
        // |  30|   Andy|
        // |  19| Justin|
        // +----+-------+
        // $example off:global_temp_view$

创建数据集

数据集与RDD类似，但是，他们不使用Java序列化或Kryo，而是使用专用的编码器来序列化对象以便通过网络进行处理或传输。虽然编码器和标准序列化都负责将对象转换为字节，但编码器是动态生成的代码，并使用一种格式，允许Spark执行许多操作，如过滤、排序和散列，而无需将字节反序列化为对象。

    public static class Person implements Serializable {
        private String name;
        private int age;

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public int getAge() {
            return age;
        }

        public void setAge(int age) {
            this.age = age;
        }
    }
    private static void runDatasetCreationExample(SparkSession spark) {
        // $example on:create_ds$
        // Create an instance of a Bean class
        Person person = new Person();
        person.setName("Andy");
        person.setAge(32);

        // Encoders are created for Java beans
        Encoder<Person> personEncoder = Encoders.bean(Person.class);
        Dataset<Person> javaBeanDS = spark.createDataset(
                Collections.singletonList(person),
                personEncoder
        );
        javaBeanDS.show();
        // +---+----+
        // |age|name|
        // +---+----+
        // | 32|Andy|
        // +---+----+

        // Encoders for most common types are provided in class Encoders
        Encoder<Integer> integerEncoder = Encoders.INT();
        Dataset<Integer> primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder);
        Dataset<Integer> transformedDS = primitiveDS.map(
                (MapFunction<Integer, Integer>) value -> value + 1,
                integerEncoder);
        transformedDS.collect(); // Returns [2, 3, 4]

        // DataFrames can be converted to a Dataset by providing a class. Mapping based on name
        String path = "examples/src/main/resources/people.json";
        Dataset<Person> peopleDS = spark.read().json(path).as(personEncoder);
        peopleDS.show();
        // +----+-------+
        // | age|   name|
        // +----+-------+
        // |null|Michael|
        // |  30|   Andy|
        // |  19| Justin|
        // +----+-------+
        // $example off:create_ds$
    }

与RDD互操作

Spark SQL支持两种不同的方法将现有RDD转换为数据集。第一种方法使用反射来推断包含特定类型对象的RDD的模式。这种基于反射的方法可以使代码更加简洁，并且在编写Spark应用程序时已经清楚的知道模式信息。

创建数据集的第二种方法是通过允许设置模式和已存在的RDD对象的编程式接口。虽然此方法更详细，但我们只能在运行时才了解列及其类型构造的数据集。

使用反射推断模式

Spark SQL支持自动将JavaBeans的RDD转换为DataFrame。BeanInfo使用反射得到，定义了表的模式。目前，Spark SQL不支持包含Map字段的JavaBean。但是支持嵌套的JavaBeans和List或Array字段。我们可以通过创建实现Serializable的类来创建JavaBean，并为其所有字段设置getter和setter。

    private static void runInferSchemaExample(SparkSession spark) {
        // $example on:schema_inferring$
        // Create an RDD of Person objects from a text file
        JavaRDD<Person> peopleRDD = spark.read()
                .textFile("examples/src/main/resources/people.txt")
                .javaRDD()
                .map(line -> {
                    String[] parts = line.split(",");
                    Person person = new Person();
                    person.setName(parts[0]);
                    person.setAge(Integer.parseInt(parts[1].trim()));
                    return person;
                });

        // Apply a schema to an RDD of JavaBeans to get a DataFrame
        Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);
        // Register the DataFrame as a temporary view
        peopleDF.createOrReplaceTempView("people");

        // SQL statements can be run by using the sql methods provided by spark
        Dataset<Row> teenagersDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");

        // The columns of a row in the result can be accessed by field index
        Encoder<String> stringEncoder = Encoders.STRING();
        Dataset<String> teenagerNamesByIndexDF = teenagersDF.map(
                (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
                stringEncoder);
        teenagerNamesByIndexDF.show();
        // +------------+
        // |       value|
        // +------------+
        // |Name: Justin|
        // +------------+

        // or by field name
        Dataset<String> teenagerNamesByFieldDF = teenagersDF.map(
                (MapFunction<Row, String>) row -> "Name: " + row.<String>getAs("name"),
                stringEncoder);
        teenagerNamesByFieldDF.show();
        // +------------+
        // |       value|
        // +------------+
        // |Name: Justin|
        // +------------+
        // $example off:schema_inferring$
    }

以编程方式指定模式

如果无法提前定义JavaBean类（例如，记录的结构以字符串形式编码，或者文本数据集将被解析，并且字段将针对不同的用户进行不同的投影），Dataset<Row>则可以以编程方式通过三个步骤创建

从原始RDD创建一个Rows的RDD；
创建由StructType匹配Row步骤1中创建的RDD中的结构表示的模式。
Row通过createDataFrame提供的方法将模式应用于RDD的SparkSession。

    private static void runProgrammaticSchemaExample(SparkSession spark) {
        // $example on:programmatic_schema$
        // Create an RDD
        JavaRDD<String> peopleRDD = spark.sparkContext()
                .textFile("examples/src/main/resources/people.txt", 1)
                .toJavaRDD();

        // The schema is encoded in a string
        String schemaString = "name age";

        // Generate the schema based on the string of schema
        List<StructField> fields = new ArrayList<>();
        for (String fieldName : schemaString.split(" ")) {
            StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
            fields.add(field);
        }
        StructType schema = DataTypes.createStructType(fields);

        // Convert records of the RDD (people) to Rows
        JavaRDD<Row> rowRDD = peopleRDD.map((Function<String, Row>) record -> {
            String[] attributes = record.split(",");
            return RowFactory.create(attributes[0], attributes[1].trim());
        });

        // Apply the schema to the RDD
        Dataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);

        // Creates a temporary view using the DataFrame
        peopleDataFrame.createOrReplaceTempView("people");

        // SQL can be run over a temporary view created using DataFrames
        Dataset<Row> results = spark.sql("SELECT name FROM people");

        // The results of SQL queries are DataFrames and support all the normal RDD operations
        // The columns of a row in the result can be accessed by field index or by field name
        Dataset<String> namesDS = results.map(
                (MapFunction<Row, String>) row -> "Name: " + row.getString(0),
                Encoders.STRING());
        namesDS.show();
        // +-------------+
        // |        value|
        // +-------------+
        // |Name: Michael|
        // |   Name: Andy|
        // | Name: Justin|
        // +-------------+
        // $example off:programmatic_schema$
    }

聚合

DataFrame提供了一些内置的聚合操作，例如count()，countDistinct()，avg()，max()，min()等等。虽然这些函数是为DataFrames设计的，但Spark SQL还为Java中的某些函数提供了类型安全版本，以便使用强类型数据集。此外，用户不限于预定义的聚合函数，还可以从创建自己的聚合函数。

自定义无类型聚合函数

用户必须扩展UserDefinedAggregateFunction抽象类以实现自定义无类型聚合函数。

/**
 * @author DKing
 * @description
 * @date 2019/6/22
 */
public class AverageTest {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL basic example")
                .config("spark.some.config.option", "some-value")
                .getOrCreate();

        // Register the function to access it
        spark.udf().register("myAverage", new MyAverage());

        Dataset<Row> df = spark.read().json("examples/src/main/resources/employees.json");
        df.createOrReplaceTempView("employees");
        df.show();
        // +-------+------+
        // |   name|salary|
        // +-------+------+
        // |Michael|  3000|
        // |   Andy|  4500|
        // | Justin|  3500|
        // |  Berta|  4000|
        // +-------+------+

        Dataset<Row> result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees");
        result.show();
        // +--------------+
        // |average_salary|
        // +--------------+
        // |        3750.0|
        // +--------------+
    }

    static class MyAverage extends UserDefinedAggregateFunction {

        private StructType inputSchema;
        private StructType bufferSchema;

        public MyAverage() {
            List<StructField> inputFields = new ArrayList<>();
            inputFields.add(DataTypes.createStructField("inputColumn", DataTypes.LongType, true));
            inputSchema = DataTypes.createStructType(inputFields);

            List<StructField> bufferFields = new ArrayList<>();
            bufferFields.add(DataTypes.createStructField("sum", DataTypes.LongType, true));
            bufferFields.add(DataTypes.createStructField("count", DataTypes.LongType, true));
            bufferSchema = DataTypes.createStructType(bufferFields);
        }

        // Data types of input arguments of this aggregate function
        @Override
        public StructType inputSchema() {
            return inputSchema;
        }

        // Data types of values in the aggregation buffer
        @Override
        public StructType bufferSchema() {
            return bufferSchema;
        }

        // The data type of the returned value
        @Override
        public DataType dataType() {
            return DataTypes.DoubleType;
        }

        // Whether this function always returns the same output on the identical input
        @Override
        public boolean deterministic() {
            return true;
        }

        // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
        // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
        // the opportunity to update its values. Note that arrays and maps inside the buffer are still
        // immutable.
        @Override
        public void initialize(MutableAggregationBuffer buffer) {
            buffer.update(0, 0L);
            buffer.update(1, 0L);
        }

        // Updates the given aggregation buffer `buffer` with new input data from `input`
        @Override
        public void update(MutableAggregationBuffer buffer, Row input) {
            if (!input.isNullAt(0)) {
                long updatedSum = buffer.getLong(0) + input.getLong(0);
                long updatedCount = buffer.getLong(1) + 1;
                buffer.update(0, updatedSum);
                buffer.update(1, updatedCount);
            }
        }

        // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
        @Override
        public void merge(MutableAggregationBuffer buffer1, Row buffer2) {
            long mergedSum = buffer1.getLong(0) + buffer2.getLong(0);
            long mergedCount = buffer1.getLong(1) + buffer2.getLong(1);
            buffer1.update(0, mergedSum);
            buffer1.update(1, mergedCount);
        }

        // Calculates the final result
        @Override
        public Double evaluate(Row buffer) {
            return ((double) buffer.getLong(0)) / buffer.getLong(1);
        }
    }
}

类型安全的用户定义聚合函数

强类型数据集的用户定义聚合围绕Aggregator抽象类。例如，类型安全的用户定义平均值如下所示

/**
 * @author DKing
 * @description
 * @date 2019/6/22
 */
public class SafeTypeAverageTest {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL basic example")
                .config("spark.some.config.option", "some-value")
                .getOrCreate();
        Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);
        String path = "examples/src/main/resources/employees.json";
        Dataset<Employee> ds = spark.read().json(path).as(employeeEncoder);
        MyAverage myAverage = new MyAverage();
        // Convert the function to a `TypedColumn` and give it a name
        TypedColumn<Employee, Double> averageSalary = myAverage.toColumn().name("average_salary");
        Dataset<Double> result = ds.select(averageSalary);
        ds.show();
        // +-------+------+
        // |   name|salary|
        // +-------+------+
        // |Michael|  3000|
        // |   Andy|  4500|
        // | Justin|  3500|
        // |  Berta|  4000|
        // +-------+------+

        result.show();
        // +--------------+
        // |average_salary|
        // +--------------+
        // |        3750.0|
        // +--------------+
    }


    public static class Employee implements Serializable {
        private String name;
        private long salary;

        public long getSalary() {
            return 0;
        }

        // Constructors, getters, setters...

    }

    public static class Average implements Serializable {
        private long sum;
        private long count;

        public Average(long l, long l1) {

        }

        public long getSum() {
            return sum;
        }

        public void setSum(long sum) {
            this.sum = sum;
        }

        public long getCount() {
            return count;
        }

        public void setCount(long count) {
            this.count = count;
        }
        // Constructors, getters, setters...

    }

    public static class MyAverage extends Aggregator<Employee, Average, Double> {
        // A zero value for this aggregation. Should satisfy the property that any b + zero = b
        @Override
        public Average zero() {
            return new Average(0L, 0L);
        }

        // Combine two values to produce a new value. For performance, the function may modify `buffer`
        // and return it instead of constructing a new object
        @Override
        public Average reduce(Average buffer, Employee employee) {
            long newSum = buffer.getSum() + employee.getSalary();
            long newCount = buffer.getCount() + 1;
            buffer.setSum(newSum);
            buffer.setCount(newCount);
            return buffer;
        }

        // Merge two intermediate values
        @Override
        public Average merge(Average b1, Average b2) {
            long mergedSum = b1.getSum() + b2.getSum();
            long mergedCount = b1.getCount() + b2.getCount();
            b1.setSum(mergedSum);
            b1.setCount(mergedCount);
            return b1;
        }

        // Transform the output of the reduction
        @Override
        public Double finish(Average reduction) {
            return ((double) reduction.getSum()) / reduction.getCount();
        }

        // Specifies the Encoder for the intermediate value type
        @Override
        public Encoder<Average> bufferEncoder() {
            return Encoders.bean(Average.class);
        }

        // Specifies the Encoder for the final output value type
        @Override
        public Encoder<Double> outputEncoder() {
            return Encoders.DOUBLE();
        }
    }
}

今天主要讲解SparkSession用法。明天开始分析数据源。