sql 去重并统计总数_使用SparkSql进行表的分析与统计

最新推荐文章于 2024-03-27 16:52:33 发布

weixin_39788382

最新推荐文章于 2024-03-27 16:52:33 发布

阅读量2.1k

点赞数 1

文章标签： sql 去重并统计总数 sql按小时统计总数分析dmp文件使用表空间

背景

我们的数据挖掘平台对数据统计有比较迫切的需求，而Spark本身对数据统计已经做了一些工作，希望梳理一下Spark已经支持的数据统计功能，后期再进行扩展。

准备数据

在参考文献6中下载https://archive.ics.uci.edu/ml/machine-learning-databases/iris/，此处格式为iris.data格式，先将data后缀改为csv后缀(不影响使用，只是为了保证后续操作不需要修改)。

数据格式如下：

鸢尾花数据格式

数据说明见附录中的鸢尾花数据。

我们先把数据放到Spark sql数仓中

CREATE TABLE IF NOT EXISTS iris ( SepalLength FLOAT , SepalWidth FLOAT   , PetalLength FLOAT , PetalWidth FLOAT   , Species VARCHAR(100) )ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','STORED AS TEXTFILELOCATION '/mnt/disk1/starqiu/iris';

表的分析与统计

Analyze Table语法如下：

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [analyze_option]

Collect statistics about the table that can be used by the query optimizer to find a better plan.

可以看到Spark表的分析可以为spark sql做查询优化，以便得到更好的查询性能。Spark Sql默认使用CBO(基于代价的优化)，这在多表join查询时尤其有用。

此处的analyze_option参数主要分为两类，表统计和列统计。

表统计

表的基本统计信息一般包括记录总数和所占空间。

Table statistics用法如下：

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [NOSCAN]

Collect only basic statistics for the table (number of rows, size in bytes).

NOSCAN

Collect only statistics that do not require scanning the whole table (that is, size in bytes).

运行命令ANALYZE TABLE iris COMPUTE STATISTICS;可以得到表的记录总数和所占空间大小。如果不想全表扫描，加上NOSCAN关键字，不会全表扫描，但只能得到所占空间大小。

表统计信息的描述命令语法如下：

DESCRIBE [EXTENDED] [db_name.]table_name

Return the metadata of an existing table (column names, data types, and comments). If the table does not exist, an exception is thrown.

EXTENDED

Display detailed information about the table, including parent database, table type, storage information, and properties.

Describe Partition

运行DESCRIBE EXTENDED iris;，结果如下：

spark-sql> DESCRIBE EXTENDED iris;SepalLength    float    NULLSepalWidth    float    NULLPetalLength    float    NULLPetalWidth    float    NULLSpecies    string    NULL        # Detailed Table Information    CatalogTable(    Table: `default`.`iris`    Owner: root    Created: Sat Feb 16 17:24:32 CST 2019    Last Access: Thu Jan 01 08:00:00 CST 1970    Type: EXTERNAL    Schema: [StructField(SepalLength,FloatType,true), StructField(SepalWidth,FloatType,true), StructField(PetalLength,FloatType,true), StructField(PetalWidth,FloatType,true), StructField(Species,StringType,true)]    Provider: hive    Properties: [rawDataSize=-1, numFiles=0, transient_lastDdlTime=1550311815, totalSize=0, COLUMN_STATS_ACCURATE=false, numRows=-1]    Statistics: sizeInBytes=3808, rowCount=150, isBroadcastable=false    Storage(Location: hdfs://data126:8020/mnt/disk1/starqiu/iris, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [field.delim=,, serialization.format=,])    Partition Provider: Catalog)    Time taken: 0.112 seconds, Fetched 7 row(s)

通过Statistics:可以看到表的记录总数是150条，所占空间3808B，约4KB。

列统计

Column statistics用法如下：

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS FOR COLUMNS col1 [, col2, ...]

Collect column statistics for the specified columns in addition to table statistics.

Tip

Use this command whenever possible because it collects more statistics so the optimizer can find better plans. Make sure to collect statistics for all columns used by the query.

列统计的描述命令语法如下：

DESCRIBE [EXTENDED][db_name.]table_name column_nameNew in version runtime-3.3.

EXTENDED

Display detailed information about the specified columns, including the column statistics collected by the command ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS columnname [column_name, ...].

需要注意的是这个功能在runtime-3.3版本才有的特性，而runtime-3.3封装的是Spark 2.2，会详见文末附录的databricks Runtime版本与Spark版本的对应关系

运行命令ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;计算指定多列的统计信息，

运行DESCRIBE EXTENDED iris SepalLength;获取指定一列的统计信息，结果如下：

spark-sql> ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;Time taken: 4.45 secondsspark-sql> DESCRIBE EXTENDED iris PetalWidth;col_name    PetalWidthdata_type    floatcomment    NULLmin    0.10000000149011612max    2.5num_nulls    0distinct_count    21avg_col_len    4max_col_len    4histogram    NULLTime taken: 0.104 seconds, Fetched 10 row(s)

目前测试Spark2.2.2不支持该语句，但是Spark2.4.0支持。如果不支持，则可以通过访问hive的元数据库也可以得到这些信息，sql语句如下：

select param_key, param_value from TABLE_PARAMS tp, TBLS t where tp.tbl_id=t.tbl_id and tbl_name = 'iris' and param_key like 'spark.sql.stat%';

以下是PetalWidth列的统计结果，可以看到包含不重复的记录数，空值数，最大值、最小值，平均长度以及最大长度

PetalWidth统计结果

总结

可以看到这些统计信息不仅对了解数据质量非常有用，对使用Spark sql进行查询也能得到优化，进一步提升速度。后续再写一篇CBO如何利用这些信息进行优化。

目前还不清楚Runtime中的Spark功能和开源版的有无差异，但Spark2.4支持表的分析统计操作，建议平台后续项目升级到Spark2.4 。

附录

鸢尾花数据说明

Iris数据集是常用的分类实验数据集，由Fisher, 1936收集整理。Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。数据集包含150个数据集，分为3类，每类50个数据，每个数据包含4个属性。iris以鸢尾花的特征作为数据来源，常用在分类操作中。该数据集由3种不同类型的鸢尾花的50个样本数据构成。其中的一个种类与另外两个种类是线性可分离的，后两个种类是非线性可分离的。

四个属性：

Sepal.Length(花萼长度)，单位是cm;

Sepal.Width(花萼宽度)，单位是cm;

Petal.Length(花瓣长度)，单位是cm;

Petal.Width(花瓣宽度)，单位是cm;

三个种类：

Iris Setosa(山鸢尾)；

Iris Versicolour(杂色鸢尾)；

Iris Virginica(维吉尼亚鸢尾)。

databricks Runtime

Runtime是databricks 统一分析平台的一部分，官网描述如下：

Accelerate innovation by unifying data science, engineering and business, with the Databricks Unified Analytics Platform, from the original creators of Apache Spark™.

Runtime的描述如下：

Simplify operations and get up to 50x better performance with cloud-optimized Apache Spark™.

可以看到主要是基于云优化来简化操作并提升50倍以上的性能。

id6)

Current Releases

当前发布版本对应关系

Marked for Deprecation

标记废弃的发布版本对应关系

Deprecated Releases

已废弃的发布版本对应关系

参考文献

https://docs.databricks.com/spark/latest/spark-sql/language-manual/analyze-table.html
https://docs.databricks.com/spark/latest/spark-sql/language-manual/describe-table.html
https://docs.databricks.com/spark/latest/spark-sql/cbo.html
https://docs.databricks.com/release-notes/runtime/databricks-runtime-ver.html#versioning
https://blog.csdn.net/Albert201605/article/details/82313139
https://archive.ics.uci.edu/ml/datasets/Iris