Kylin概述
定义
Kylin是一个开源的分布式分析引擎,提供Hadoop/Spark上的SQL查询接口以及多维分析(OLAP以支持超大规模数据,最初由eBay开发并贡献至开源社区,能在亚秒内查询巨大的Hive表
构架
REST Server:是一套面向应用程开发的入口点,旨在实现针对Kylin平台的应用开发工作。此类应用程序可以提供查询、获取结果、触发cube构建任务、获取元数据以及获取用户权限等等。另外可以公国Restful接口实现SQL查询。
Query Engine(查询引擎):当cube准备就绪后,查询引擎就能够获取并解析用户查询。他随后会与系统中的其它组件进行交互,从而返回用户对应的结果。
路由器:最初设计时曾考虑将Kylin不能执行的查询引导去Hive中继续执行,但是实践后发现Kylin与hive的速度差异过大,导致用户无法对查询素有有一致的期望,很可能有的查询几秒返回结果,有的查询要几分钟。体验非常糟糕,最后这个路由功能在发行版中默认关闭的
元数据管理工具:Kylin是一款元数据驱动型应用程序。元数据管理工具十一大关键性组件,用于保存Kylin当中所有元数据进行管理。其中最为重要的cube元数据。其它全部组件的正常运作都要以元数据管理工具为基础。Kylin的元数据存储在Hbase中
任务引擎:这套引擎的设计目的在于处理所有的离线任务,其中包括shell脚本、JavaAPI以及MapReduce等。任务引擎对Kylin当中的全部任务加一管理与协调,确保每一项任务都能得到切实的执行并解决期间出现的故障。
Kylin特点
支持SQL接口、支持超大规模数据集、亚秒级响应、可伸缩性、高吞吐率、BI工具集成。
1.标准的SQL接口:Kylin是以标准的SQL作为对外服务的接口
2.支持超大数据集:对于大数据的支撑能力可能是目前所有技术中最为领先的
3.亚秒级响应:很多复杂的计算:连接、聚合等,在离线的预计算过程已经完成,大大降低了查询所需的计算量
4.可伸缩型和高吞吐率:单节点Kylin可以实现每秒70个查询,还可以搭建Kylin集群
5.BI工具集成
Kylin可以与现有的所有BI工具集成,如:
ODBC:与Tableau、Excel、PowerBI等工具集成
JDBC:与Saiku、BIRT等Java工具集成
RestAPI:与JavaScript、Web网页集成
也可以使用Zepplin的插件来访问Kylin服务
环境搭建
安装
http://kylin.apache.org/cn/download/
部署
必须保证/etc/profile中HADOOP_HOME,HIVE_HOME,HBASE_HOME,并且source使其生效
[yyx@hadoop01 kylin]$ bin/kylin.sh start
用户名为:ADMIN,密码为:KYLIN(系统已填)
小实例
需求:实现按照维度(工作地点)统计员工信息
在Hive中创建数据,分别创建部门和员工外部表,并向表中导入数据。
创建员工表和部门表
Time taken: 0.065 seconds, Fetched: 2 row(s)
hive (default)> create external table if not exists default.dept(
> deptno int,
> dname string,
> loc int
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 1.131 seconds
hive (default)> create external table if not exists default.emp(
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal double,
> comm double,
> deptno int)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.173 seconds
hive (default)>
准备数据:
[yyx@hadoop01 datas]$ vim dept.txt
10 ACCOUNTING 1700
20 RESEARCH 1800
30 SALES 1900
40 OPERATIONS 1700
~
[yyx@hadoop01 datas]$ vim emp.txt
7369 SMITH CLERK 7902 1980-12-17 800.00 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.00 300.00 30
7521 WARD SALESMAN 7698 1981-2-22 1250.00 500.00 30
7566 JONES MANAGER 7839 1981-4-2 2975.00 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.00 1400.00 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.00 30
7782 CLARK MANAGER 7839 1981-6-9 2450.00 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.00 20
7839 KING PRESIDENT 1981-11-17 5000.00 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.00 0.00 30
7876 ADAMS CLERK 7788 1987-5-23 1100.00 20
7900 JAMES CLERK 7698 1981-12-3 950.00 30
7902 FORD ANALYST 7566 1981-12-3 3000.00 20
7934 MILLER CLERK 7782 1982-1-23 1300.00 10
数据导入:
Time taken: 0.173 seconds
hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept;
Loading data to table default.dept
Table default.dept stats: [numFiles=1, totalSize=69]
OK
Time taken: 0.828 seconds
hive (default)> load data local inpath '/opt/module/datas/emp.txt' into table default.emp;
Loading data to table default.emp
Table default.emp stats: [numFiles=1, totalSize=657]
OK
Time taken: 0.558 seconds
hive (default)>
hive (default)> select * from emp;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.206 seconds, Fetched: 14 row(s)
hive (default)> select * from dept;
OK
dept.deptno dept.dname dept.loc
10 ACCOUNTING 1700
20 RESEARCH 1800
30 SALES 1900
40 OPERATIONS 1700
Time taken: 0.055 seconds, Fetched: 4 row(s)
创建工程、之后导入数据
创建Model
设置实时表和维度表
设置维度表
维度属性
度量属性
分区和过滤
创建Cube
指定Module
选择维度
求维度:默认的count,可删除,我们用sum_sal来求和
优化:暂时不用
rowkey:
导入曾经的Cube
cube引擎(默认MR)
快照、列族
覆盖属性
总览
build一下
等到100%
查询一下(只有添加的字段,而且sal被聚合了,没有数据)
明先比hive快
可视化、导出:
其中,第四步
设置增量更新的信息(设置增量列和起始时间,该列必须是时间格式列)设置merge时间。Kylin每次build会生成一张hbase表,merge操作会把多天数据合并成一张新的hbase表。可加快查询。设置partition Start Date,即数据源开始时间,默认为1970-01-01.点击Next。
restAPI
在Insight中查询:select job,sum(sal) from emp join dept on dept.deptno = emp.deptno group by job;
用restAPI查询:
官网原版,我们修改一下
curl -X POST -H "Authorization: Basic XXXXXXXXX" -H "Content-Type: application/json" -d '{ "sql":"select count(*) from TEST_KYLIN_FACT", "project":"learn_kylin" }' http://localhost:7070/kylin/api/query
curl -X POST --user ADMIN:KYLIN -H "Content-Type: application/json" -d '{ "sql":"select job,sum(sal) from emp join dept on dept.deptno = emp.deptno group by job;", "project":"learn_kylin" }' http://hadoop01:7070/kylin/api/query
[yyx@hadoop01 datas]$ curl -X POST --user ADMIN:KYLIN -H "Content-Type: application/json" -d '{ "sql":"select job,sum(sal) from emp join dept on dept.deptno = emp.deptno group by job;", "project":"learn_kylin" }' http://hadoop01:7070/kylin/api/query
{"columnMetas":[{"isNullable":1,"displaySize":256,"label":"JOB","name":"JOB","schemaName":"DEFAULT","catelogName":null,"tableName":"EMP","precision":256,"scale":0,"columnType":12,"columnTypeName":"VARCHAR","writable":false,"autoIncrement":false,"caseSensitive":true,"searchable":false,"currency":false,"definitelyWritable":false,"signed":true,"readOnly":true},{"isNullable":1,"displaySize":15,"label":"EXPR$1","name":"EXPR$1","schemaName":null,"catelogName":null,"tableName":null,"precision":15,"scale":0,"columnType":8,"columnTypeName":"DOUBLE","writable":false,"autoIncrement":false,"caseSensitive":true,"searchable":false,"currency":false,"definitelyWritable":false,"signed":true,"readOnly":true}],"results":[["ANALYST","6000.0"],["SALESMAN","5600.0"],["CLERK","4150.0"],["PRESIDENT","5000.0"],["MANAGER","8275.0"]],"cube":"CUBE[name=FirstCube]","affectedRowCount":0,"isException":false,"exceptionMessage":null,"duration":213,"totalScanCount":5,"totalScanBytes":240,"hitExceptionCache":false,"storageCacheUsed":false,"traceUrl":null,"pushDown":false,"partial":false}
JDBC
导入依赖:
<dependencies>
<dependency>
<groupId>org.apache.kylin</groupId>
<artifactId>kylin-jdbc</artifactId>
<version>2.5.1</version>
</dependency>
</dependencies>
JDBC代码
package com.yyx.kylinjdcb;
import java.sql.*;
public class KylinJDBC {
public static void main(String[] args) throws ClassNotFoundException, SQLException {
String driver = "org.apache.kylin.jdbc.Driver";
String user = "ADMIN";
String passWord = "KYLIN";
String url = "jdbc:kylin://hadoop01:7070/learn_kylin";
Class name = Class.forName(driver);
Connection connection = DriverManager.getConnection(url, user, passWord);
String sql = "select job,sum(sal) from emp join dept on dept.deptno = emp.deptno group by job;";//测试用就没必要防止sql注入了
PreparedStatement preparedStatement = connection.prepareStatement(sql);
ResultSet resultSet = preparedStatement.executeQuery();
while(resultSet.next()){
System.out.println(resultSet.getString(1)+"\t"+resultSet.getDouble(2));
}
}
}
结果: