spark读取clickhouse数据时存在着诸多限制
如:
- 不支持array数组类型的读取
- clickhouse存在着需要用final修饰的表
- clickhouse按照分区过滤
- ……
在不造轮子的情况下,可以用spark jdbc的一些骚操作实现clickhouse sql
以下是sparksql jdbc获取数据结构的源码
/**
* Get the SQL query that should be used to find if the given table exists. Dialects can
* override this method to return a query that works best in a particular database.
* @param table The name of the table.
* @return The SQL query to use for checking the table.
*/
def getTableExistsQuery(table: String): String = {
s"SELECT * FROM $table WHERE 1=0"
}
/**
* The SQL query that should be used to discover the schema of a table. It only needs to
* ensure that the result set has the same schema as the table, such as by calling
* "SELECT * ...". Dialects can override this method to return a query that works best in a
* particular database.
* @param table The name of the table.
* @return The SQL query to use for discovering the schema.
*/
@Since("2.1.0")
def getSchemaQuery(table: String): String = {
s"SELECT * FROM $table WHERE 1=0"
}
spark jdbc采用sql获取数据结构,而table取值自option内的dbtable,因此我们通过修改dbtable别名的方式可以实现spark操作clickhouse sql
示例如下:
spark.read
.format("jdbc")
.option("driver","ru.yandex.clickhouse.ClickHouseDriver")
.option("url", "jdbc:clickhouse://test:8123/")
.options(Map("user"->"test","password"->"test"))
.option("dbtable", "(select name from dw.test fianl)a")
.load()
.show()
spark sql获取表结构的sql为
SELECT * FROM (select name from dw.test fianl)a