10.1 spark-sql 10亿级数据交互式秒级查询可行性

最新推荐文章于 2024-07-18 06:31:19 发布

置顶我的海_

最新推荐文章于 2024-07-18 06:31:19 发布

阅读量3.9k

点赞数 4

本文链接：https://blog.csdn.net/kk25114/article/details/97137002

版权

当前版本:saprk2.4 cdh 数据演示为10亿,41列

sparksql提供了类sql的标准,支持数学函数,聚合函数,时间函数,字符串函数,支持已经很完善了

参考:https://spark.apache.org/docs/2.4.0/api/sql/index.html

One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.

在cdh版本中不提供spark thrift server ,所以需要从apache版本中集成或者使用livy(后面会讲到)

当前使用livy提供的restapi(hue底层使用的livy),可以支持kind,sql,scala,为spark而生,可以配置session 数量,time,叫spark thrift server支持多用户,可以添加spark属性,并且有性能对sparksql性能优化的属性设置

sparksql在同等环境读取hive parquet 测试下执行count查询,效率低于impala

impala基于内存,impala3对sql支持很好(并非之前的支持简单的函数),但是impala首次查询(不管是大小表不会既刻响应)慢,再查询会很快(缓存),如对于含有如group by的句子不会很快,对于impala的持久化使用的基于hdfs admincache,对表启动HDFS缓存,可参考cdh6.1.1 impala优化

https://www.cloudera.com/documentation/enterprise/6/6.1/topics/impala_perf_hdfs_caching.html#hdfs_caching

saprksql支持持久化到内存,(并且支持缓存临时表)内存不足会落到磁盘并且支持压缩,impala是基于内存的,预cache到hdfs缓存池

saprksql稳定并且保存模式支持Append,Overwrite,并行计算实现交互式秒查询已经能够很稳定的实现

Saving to Persistent Tables

DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable command. Notice that an existing Hive deployment is not necessary to use this feature. Spark will create a default local Hive metastore (using Derby) for you. Unlike the createOrReplaceTempView command, saveAsTable will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as long as you maintain your connection to the same metastore. A DataFrame for a persistent table can be created by calling the tablemethod on a SparkSession with the name of the table.

For file-based data source, e.g. text, parquet, json, etc. you can specify a custom table path via the path option, e.g. df.write.option("path", "/some/path").saveAsTable("t"). When the table is dropped, the custom table path will not be removed and the table data is still there. If no custom table path is specified, Spark will write data to a default table path under the warehouse directory. When the table is dropped, the default table path will be removed too.

Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:

Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.
Hive DDLs such as ALTER TABLE PARTITION ... SET LOCATION are now available for tables created with the Datasource API.

Note that partition information is not gathered by default when creating external datasource tables (those with a path option). To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.

引用

这几个框架都是OLAP大数据分析比较常见的框架，各自特点如下：
presto：facebook开源的一个java写的分布式数据查询框架，原生集成了Hive、Hbase和关系型数据库，Presto背后所使用的执行模式与Hive有根本的不同，它没有使用MapReduce，大部分场景下比hive快一个数量级，其中的关键是所有的处理都在内存中完成。
Druid：是一个实时处理时序数据的Olap数据库，因为它的索引首先按照时间分片，查询的时候也是按照时间线去路由索引。
spark SQL：基于spark平台上的一个olap框架，本质上也是基于DAG的MPP，基本思路是增加机器来并行计算，从而提高查询速度。
kylin：核心是Cube，cube是一种预计算技术，基本思路是预先对数据作多维索引，查询时只扫描索引而不访问原始数据从而提速。

这几种框架各有优缺点，存在就是合理，如何选型个人看法如下：

从成熟度来讲：kylin>spark sql>Druid>presto

从超大数据的查询效率来看：Druid>kylin>presto>spark sql

从支持的数据源种类来讲：presto>spark sql>kylin>Druid

大数据查询目前来讲可以大体分为三类：

1.基于hbase预聚合的，比如Opentsdb,Kylin,Druid等,需要指定预聚合的指标，在数据接入的时候根据指定的指标进行聚合运算，适合相对固定的业务报表类需求，只需要统计少量维度即可满足业务报表需求

2.基于Parquet列式存储的，比如Presto, Drill，Impala等，基本是完全基于内存的并行计算，Parquet系能降低存储空间，提高IO效率，以离线处理为主，很难提高数据写的实时性，超大表的join支持可能不够好。spark sql也算类似，但它在内存不足时可以spill disk来支持超大数据查询和join

3.基于lucene外部索引的，比如ElasticSearch和Solr,能够满足的的查询场景远多于传统的数据库存储，但对于日志、行为类时序数据，所有的搜索请求都也必须搜索所有的分片，另外，对于聚合分析场景的支持也是软肋