大数据开发平台：数仓组件Apache Kylin详细安装暨使用教程

最新推荐文章于 2024-07-30 21:09:27 发布

BigData攻城狮

最新推荐文章于 2024-07-30 21:09:27 发布

阅读量1.6k

点赞数 4

分类专栏：大数据环境搭建全家桶

本文链接：https://blog.csdn.net/dlc_996/article/details/106108629

版权

大数据环境搭建全家桶专栏收录该内容

9 篇文章 6 订阅

订阅专栏

前言：

Kylin依赖于Hive和Hbase，所以必须保证Hive和Hbase能够正常使用。

Hive安装教程【传送门】

Hbase安装教程【传送门】

简介：

　Apache Kylin是一个开源的分布式分析引擎，提供Hadoop之上的SQL查询接口及多维分析（OLAP）能力以支持超大规模数据，最初由eBay 开发并贡献至开源社区。它能在亚秒内查询巨大的Hive表。是国人之光，由中国人自主研发的大数据组件，2014年10月进行开源，同年11月称为Apache孵化器项目，于2015年12月，Apache基金会ASF正式升级Kylin为顶级项目。其发展之快可谓迅猛，搭上了基于大数据的OLAP快车，同时也感慨开源之伟大与贡献。

开始安装：

一：下载安装包并上传到Linux系统

官方下载网址【传送门】中文就是骄傲

上传文件教程【传送门】

二：解压

tar -zxvf apache-kylin-2.6.3-bin-hbase1x.tar.gz -C ../servers/

三：配置Kylin

1、增加kylin依赖组件的配置

/export/servers/apache-kylin-2.6.3-bin-hbase1x/conf
ln -s $HADOOP_HOME/etc/hadoop/hdfs-site.xml hdfs-site.xml
ln -s $HADOOP_HOME/etc/hadoop/core-site.xml core-site.xml
ln -s $HBASE_HOME/conf/hbase-site.xml hbase-site.xml
ln -s $HIVE_HOME/conf/hive-site.xml hive-site.xml
ln -s $SPARK_HOME/conf/spark-defaults.conf spark-defaults.conf

2、配置conf/kylin.properties

删除掉原本conf/kylin.properties中的内容，并覆盖以下新内容：

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#    http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# The below commented values will effect as default settings
# Uncomment and override them if necessary
#### METADATA | ENV ###
## The metadata store in hbase
#kylin.metadata.url=kylin_metadata@hbase
## metadata cache sync retry times
#kylin.metadata.sync-retries=3
## Working folder in HDFS, better be qualified absolute path, make sure user has the right permission to this directory
kylin.env.hdfs-working-dir=/apps/kylin
## DEV|QA|PROD. DEV will turn on some dev features, QA and PROD has no difference in terms of functions.
#kylin.env=QA
## kylin zk base path
kylin.env.zookeeper-base-path=/kylin
#### SERVER | WEB | RESTCLIENT ###
## Kylin server mode, valid value [all, query, job]
#kylin.server.mode=all
## List of web servers in use, this enables one web server instance to sync up with other servers.
#kylin.server.cluster-servers=localhost:7070
## Display timezone on UI,format like[GMT+N or GMT-N]
#kylin.web.timezone=
## Timeout value for the queries submitted through the Web UI, in milliseconds
#kylin.web.query-timeout=300000
#kylin.web.cross-domain-enabled=true
##allow user to export query result
#kylin.web.export-allow-admin=true
#kylin.web.export-allow-other=true
## Hide measures in measure list of cube designer, separate by comma
#kylin.web.hide-measures=RAW
##max connections of one route
#kylin.restclient.connection.default-max-per-route=20
##max connections of one rest-client
#kylin.restclient.connection.max-total=200
#### PUBLIC CONFIG ###
#kylin.engine.default=2
#kylin.storage.default=2
#kylin.web.hive-limit=20
#kylin.web.help.length=4
#kylin.web.help.0=start|Getting Started|http://kylin.apache.org/docs/tutorial/kylin_sample.html
#kylin.web.help.1=odbc|ODBC Driver|http://kylin.apache.org/docs/tutorial/odbc.html
#kylin.web.help.2=tableau|Tableau Guide|http://kylin.apache.org/docs/tutorial/tableau_91.html
#kylin.web.help.3=onboard|Cube Design Tutorial|http://kylin.apache.org/docs/howto/howto_optimize_cubes.html
#kylin.web.link-streaming-guide=http://kylin.apache.org/
#kylin.htrace.show-gui-trace-toggle=false
#kylin.web.link-hadoop=
#kylin.web.link-diagnostic=
#kylin.web.contact-mail=
#kylin.server.external-acl-provider=
#### SOURCE ###
## Hive client, valid value [cli, beeline]
#kylin.source.hive.client=cli
## Absolute path to beeline shell, can be set to spark beeline instead of the default hive beeline on PATH
#kylin.source.hive.beeline-shell=beeline
## Parameters for beeline client, only necessary if hive client is beeline
##kylin.source.hive.beeline-params=-n root --hiveconf hive.security.authorization.sqlstd.confwhitelist.append='mapreduce.job.*|dfs.*' -u jdbc:hive2://localhost:10000
## While hive client uses above settings to read hive table metadata,
## table operations can go through a separate SparkSQL command line, given SparkSQL connects to the same Hive metastore.
#kylin.source.hive.enable-sparksql-for-table-ops=false
#kylin.source.hive.sparksql-beeline-shell=/path/to/spark-client/bin/beeline
#kylin.source.hive.sparksql-beeline-params=-n root --hiveconf hive.security.authorization.sqlstd.confwhitelist.append='mapreduce.job.*|dfs.*' -u jdbc:hive2://localhost:10000
kylin.source.hive.keep-flat-table=false
## Hive database name for putting the intermediate flat tables
kylin.source.hive.database-for-flat-table=default
## Whether redistribute the intermediate flat table before building
kylin.source.hive.redistribute-flat-table=true
#### STORAGE ###
## The storage for final cube file in hbase
kylin.storage.url=hbase
## The prefix of hbase table
kylin.storage.hbase.table-name-prefix=KYLIN_
## The namespace for hbase storage
kylin.storage.hbase.namespace=default
## Compression codec for htable, valid value [none, snappy, lzo, gzip, lz4]
kylin.storage.hbase.compression-codec=none
## HBase Cluster FileSystem, which serving hbase, format as hdfs://hbase-cluster:8020
## Leave empty if hbase running on same cluster with hive and mapreduce
##kylin.storage.hbase.cluster-fs=
## The cut size for hbase region, in GB.
#kylin.storage.hbase.region-cut-gb=5
## The hfile size of GB, smaller hfile leading to the converting hfile MR has more reducers and be faster.
## Set 0 to disable this optimization.
#kylin.storage.hbase.hfile-size-gb=2
#kylin.storage.hbase.min-region-count=1
#kylin.storage.hbase.max-region-count=500
## Optional information for the owner of kylin platform, it can be your team's email
## Currently it will be attached to each kylin's htable attribute
#kylin.storage.hbase.owner-tag=whoami@kylin.apache.org
#kylin.storage.hbase.coprocessor-mem-gb=3
## By default kylin can spill query's intermediate results to disks when it's consuming too much memory.
## Set it to false if you want query to abort immediately in such condition.
#kylin.storage.partition.aggr-spill-enabled=true
## The maximum number of bytes each coprocessor is allowed to scan.
## To allow arbitrary large scan, you can set it to 0.
#kylin.storage.partition.max-scan-bytes=3221225472
## The default coprocessor timeout is (hbase.rpc.timeout * 0.9) / 1000 seconds,
## You can set it to a smaller value. 0 means use default.
## kylin.storage.hbase.coprocessor-timeout-seconds=0
## clean real storage after delete operation
## if you want to delete the real storage like htable of deleting segment, you can set it to true
#kylin.storage.clean-after-delete-operation=false
#### JOB ###
## Max job retry on error, default 0: no retry
#kylin.job.retry=0
## Max count of concurrent jobs running
#kylin.job.max-concurrent-jobs=10
## The percentage of the sampling, default 100%
#kylin.job.sampling-percentage=100
## If true, will send email notification on job complete
##kylin.job.notification-enabled=true
##kylin.job.notification-mail-enable-starttls=true
##kylin.job.notification-mail-host=smtp.office365.com
##kylin.job.notification-mail-port=587
##kylin.job.notification-mail-username=kylin@example.com
##kylin.job.notification-mail-password=mypassword
##kylin.job.notification-mail-sender=kylin@example.com
#
#
#### ENGINE ###
#
## Time interval to check hadoop job status
#kylin.engine.mr.yarn-check-interval-seconds=10
#
#kylin.engine.mr.reduce-input-mb=500
#
#kylin.engine.mr.max-reducer-number=500
#
#kylin.engine.mr.mapper-input-rows=1000000
#
## Enable dictionary building in MR reducer
#kylin.engine.mr.build-dict-in-reducer=true
#
## Number of reducers for fetching UHC column distinct values
#kylin.engine.mr.uhc-reducer-count=3
#
## Whether using an additional step to build UHC dictionary
#kylin.engine.mr.build-uhc-dict-in-additional-step=false
#
#
#### CUBE | DICTIONARY ###
#
#kylin.cube.cuboid-scheduler=org.apache.kylin.cube.cuboid.DefaultCuboidScheduler
#kylin.cube.segment-advisor=org.apache.kylin.cube.CubeSegmentAdvisor
#
## 'auto', 'inmem', 'layer' or 'random' for testing
#kylin.cube.algorithm=layer
#
## A smaller threshold prefers layer, a larger threshold prefers in-mem
#kylin.cube.algorithm.layer-or-inmem-threshold=7
#
## auto use inmem algorithm:
## 1, cube planner optimize job
## 2, no source record
#kylin.cube.algorithm.inmem-auto-optimize=true
#
#kylin.cube.aggrgroup.max-combination=32768
#
#kylin.snapshot.max-mb=300
#
#kylin.cube.cubeplanner.enabled=true
#kylin.cube.cubeplanner.enabled-for-existing-cube=true
#kylin.cube.cubeplanner.expansion-threshold=15.0
#kylin.cube.cubeplanner.recommend-cache-max-size=200
#kylin.cube.cubeplanner.mandatory-rollup-threshold=1000
#kylin.cube.cubeplanner.algorithm-threshold-greedy=8
#kylin.cube.cubeplanner.algorithm-threshold-genetic=23
#
#
#### QUERY ###
#
## Controls the maximum number of bytes a query is allowed to scan storage.
## The default value 0 means no limit.
## The counterpart kylin.storage.partition.max-scan-bytes sets the maximum per coprocessor.
#kylin.query.max-scan-bytes=0
#
#kylin.query.cache-enabled=true
#
## Controls extras properties for Calcite jdbc driver
## all extras properties should undder prefix "kylin.query.calcite.extras-props."
## case sensitive, default: true, to enable case insensitive set it to false
## @see org.apache.calcite.config.CalciteConnectionProperty.CASE_SENSITIVE
#kylin.query.calcite.extras-props.caseSensitive=true
## how to handle unquoted identity, defualt: TO_UPPER, available options: UNCHANGED, TO_UPPER, TO_LOWER
## @see org.apache.calcite.config.CalciteConnectionProperty.UNQUOTED_CASING
#kylin.query.calcite.extras-props.unquotedCasing=TO_UPPER
## quoting method, default: DOUBLE_QUOTE, available options: DOUBLE_QUOTE, BACK_TICK, BRACKET
## @see org.apache.calcite.config.CalciteConnectionProperty.QUOTING
#kylin.query.calcite.extras-props.quoting=DOUBLE_QUOTE
## change SqlConformance from DEFAULT to LENIENT to enable group by ordinal
## @see org.apache.calcite.sql.validate.SqlConformance.SqlConformanceEnum
#kylin.query.calcite.extras-props.conformance=LENIENT
#
## TABLE ACL
#kylin.query.security.table-acl-enabled=true
#
## Usually should not modify this
#kylin.query.interceptors=org.apache.kylin.rest.security.TableInterceptor
#
#kylin.query.escape-default-keyword=false
#
## Usually should not modify this
#kylin.query.transformers=org.apache.kylin.query.util.DefaultQueryTransformer,org.apache.kylin.query.util.KeywordDefaultDirtyHack
#
#### SECURITY ###
#
## Spring security profile, options: testing, ldap, saml
## with "testing" profile, user can use pre-defined name/pwd like KYLIN/ADMIN to login
#kylin.security.profile=testing
#
## Admin roles in LDAP, for ldap and saml
#kylin.security.acl.admin-role=admin
#
## LDAP authentication configuration
#kylin.security.ldap.connection-server=ldap://ldap_server:389
#kylin.security.ldap.connection-username=
#kylin.security.ldap.connection-password=
#
## LDAP user account directory;
#kylin.security.ldap.user-search-base=
#kylin.security.ldap.user-search-pattern=
#kylin.security.ldap.user-group-search-base=
#kylin.security.ldap.user-group-search-filter=(|(member={0})(memberUid={1}))
#
## LDAP service account directory
#kylin.security.ldap.service-search-base=
#kylin.security.ldap.service-search-pattern=
#kylin.security.ldap.service-group-search-base=
#
### SAML configurations for SSO
## SAML IDP metadata file location
#kylin.security.saml.metadata-file=classpath:sso_metadata.xml
#kylin.security.saml.metadata-entity-base-url=https://hostname/kylin
#kylin.security.saml.keystore-file=classpath:samlKeystore.jks
#kylin.security.saml.context-scheme=https
#kylin.security.saml.context-server-name=hostname
#kylin.security.saml.context-server-port=443
#kylin.security.saml.context-path=/kylin
#
#### SPARK ENGINE CONFIGS ###
#
## Hadoop conf folder, will export this as "HADOOP_CONF_DIR" to run spark-submit
## This must contain site xmls of core, yarn, hive, and hbase in one folder
kylin.env.hadoop-conf-dir=/export/servers/hadoop-2.6.0-cdh5.14.0/etc/hadoop
#
## Estimate the RDD partition numbers
kylin.engine.spark.rdd-partition-cut-mb=10
#
## Minimal partition numbers of rdd
kylin.engine.spark.min-partition=1
#
## Max partition numbers of rdd
kylin.engine.spark.max-partition=1000
#
## Spark conf (default is in spark/conf/spark-defaults.conf)
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.driver.memory=512M
kylin.engine.spark-conf.spark.executor.memory=1G
kylin.engine.spark-conf.spark.executor.instances=2
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=512
kylin.engine.spark-conf.spark.shuffle.service.enabled=true
kylin.engine.spark-conf.spark.eventLog.enabled=true
kylin.engine.spark-conf.spark.eventLog.dir=hdfs://node01/apps/spark2/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs://node01:8020/apps/spark2/spark-history
kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false
#
#### Spark conf for specific job
kylin.engine.spark-conf-mergedict.spark.executor.memory=1G
kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2
#
## manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar at runtime
kylin.engine.spark-conf.spark.yarn.archive=hdfs://node01:8020/apps/spark2/lib/spark-libs.jar
kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
#
## uncomment for HDP
##kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current
##kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=current
##kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current
#
#
#### QUERY PUSH DOWN ###
#
##kylin.query.pushdown.runner-class-name=org.apache.kylin.query.adhoc.PushDownRunnerJdbcImpl
#
##kylin.query.pushdown.update-enabled=false
##kylin.query.pushdown.jdbc.url=jdbc:hive2://sandbox:10000/default
##kylin.query.pushdown.jdbc.driver=org.apache.hive.jdbc.HiveDriver
##kylin.query.pushdown.jdbc.username=hive
##kylin.query.pushdown.jdbc.password=
#
##kylin.query.pushdown.jdbc.pool-max-total=8
##kylin.query.pushdown.jdbc.pool-max-idle=8
##kylin.query.pushdown.jdbc.pool-min-idle=0
#
#### JDBC Data Source
##kylin.source.jdbc.connection-url=
##kylin.source.jdbc.driver=
##kylin.source.jdbc.dialect=
##kylin.source.jdbc.user=
##kylin.source.jdbc.pass=
##kylin.source.jdbc.sqoop-home=
##kylin.source.jdbc.filed-delimiter=|

3、配置kylin.sh

/export/servers/apache-kylin-2.6.3-bin-hbase1x/bin
vim kylin.sh

kylin.sh文件最后面追加如下内容：

export HADOOP_HOME=${HADOOP_HOME}
export HIVE_HOME=${HIVE_HOME}
export HBASE_HOME=${HBASE_HOME}
export SPARK_HOME=${SPARK_HOME}

4、初始化kylin在hdfs上的数据路径

hadoop fs -mkdir -p /apps/kylin

四：启动Kylin准备工作

1、启动zookeeper

zkServer.sh start

2、启动HDFS

/export/servers/spark-2.2.0-bin-2.6.0-cdh5.14.0/sbin/start-all.sh

3、启动YARN集群

不用单独启动 启动Hadoop时就会启动Yarn

4、启动HBase集群

start-hbase.sh start

5、启动 metastore

nohup hive --service metastore &

6、启动 hiverserver2

nohup hive --service hiveserver2 &

7、启动Yarn history server

/export/servers/hadoop-2.6.0-cdh5.14.0/sbin/mr-jobhistory-daemon.sh start historyserver

五：启动Kylin

/export/servers/apache-kylin-2.6.3-bin-hbase1x/bin/kylin.sh start

看到下图就说明你成功了，你成功了 👍！！！

访问WEB界面 http://node1:7070/kylin 用户名：ADMIN 密码：KYLIN

六：使用教程：

1、准备测试数据：

创建数据库、表、加载数据
create table dw_sales(id string,date1 string,channelId string, productId string, regionId string,amount int,price double)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile;

-- 2、渠道表：dim_channel
-- channelId 渠道ID
-- channelName 渠道名称
create table dim_channel(channelId string, channelName string )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile;

-- 3、产品表：dim_product
create table dim_product(productId string, productName string )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile;

--4、区域表：dim_region
create table dim_region(regionId string,regionName string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile;

-- 导入数据
LOAD DATA LOCAL INPATH '/opt/kylindatas/dw_sales_data.txt' OVERWRITE  INTO TABLE dw_sales;
LOAD DATA LOCAL INPATH '/opt/kylindatas/dim_channel_data.txt' OVERWRITE  INTO TABLE dim_channel;
LOAD DATA LOCAL INPATH '/opt/kylindatas/dim_product_data.txt' OVERWRITE  INTO TABLE dim_product;
LOAD DATA LOCAL INPATH '/opt/kylindatas/dim_region_data.txt' OVERWRITE  INTO TABLE dim_region;

2、测试业务查询效率

需求：按照日期和渠道查看日期以及对应的交易金额和交易数量。

select 
date1, sum(price) as total_money, sum(amount) as total_amount 
from dw_sales 
group by date1,channelid;

第一次查询结果：

第二次查询结果：

接下来使用Kylin来完成刚才的需求：

1、创建项目

2、创建数据源（DataSource）

3、创建模型（Model）

设置model名称

选择事实表

选择事实表中可能用到的维度字段

选择查询指标

查看创建的项目

4、创建立方体（Cube）

选择需要用到明确维度

添加需要计算的度量/指标

修改计算引擎

选择MapReduce计算

Cube创建成功

5、执行构建

等待构建完成

构建完成

6、使用Kylin执行SQL查询，获取结果，从Cube中查询数据

第一次结果3.88s

第二次查询 0.01，亚秒级内返回结果

都看到这儿了就点个赞呗🌹🌹🌹

👇看完点赞👍 养成习惯😘! ! !

BigData攻城狮

关注

4
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录