How To Configure Elasticsearch on Hadoop with HDP

转载 2015年11月18日 18:19:00

原文地址:http://www.tuicool.com/articles/Jryyme


Elasticsearch’s engine integrates with Hortonworks Data Platform 2.0 and YARN to provide real-time search and access to information in Hadoop.

See it in action:  register for the Hortonworks and Elasticsearch webinar on March 5th  2014 at 10 am PST/1pm EST to see the demo and an outline for best practices when integrating Elasticsearch and HDP 2.0 to extract maximum insights from your data.   Click here to register for this exciting and informative webinar!

Try it yourself: Get started with this tutorial using Elasticsearch and Hortonworks Data Platform, or Hortonworks Sandbox to access server logs in Kibana using Apache Flume for ingestion.

Architecture

Following diagram depicts the proposed architecture to index the logs in near real-time into Elasticsearch and also save to Hadoop for long-term batch analytics.

es1

Components

Elasticsearch

Elasticsearch is a search engine that can index new documents in near real-time and make them immediately available for querying. Elasticsearch is based on Apache Lucene and allows for setting up clusters of nodes that store any number of indices in a distributed, fault-tolerant way. If a node disappears, the cluster will rebalance the (shards of) indices over the remaining nodes. You can configure how many shards make up each index and how many replicas of these shards there should be. If a master shard goes offline, one of the replicas is promoted to master and used to repopulate another node.

Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into different storage destinations like Hadoop Distributed File System. It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

Kibana

Kibana is an open source (Apache Licensed), browser based analytics and search interface to Logstash and other timestamped data sets stored in ElasticSearch. Kibana strives to be easy to get started with, while also being flexible and powerful

System Requirements

  • Hadoop: Hortonworks Data Platform 2.0(HDP 2.0) or HDP Sandbox for HDP 2.0
  • OS: 64 bit RHEL (Red Hat Enterprise Linux) 6, CentOS, Oracle Linux 6
  • Software:  yum, rpm, unzip, tar, wget, java
  • JDK: Oracle 1.7 64, Oracle 1.6 update 31, Open JDK 7

Java Installation

Note: Define the JAVA_HOME environment variable and add the Java Virtual Machine and the Java binaries to your PATH environment variable.

Execute the following command to verify that the Java is in the PATH:

export JAVA_HOME=/usr/java/default 
export PATH=$JAVA_HOME/bin:$PATH 
java -version

Flume Installation

Execute the following commands to install flume binaries and agent scripts 
yum install flume-agent flume

Elasticsearch Installation

Latest Elasticsearch can be downloaded from the following URL http://www.elasticsearch.org/download/

RPM Downloads can be found in https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.noarch.rpm

To install Elasticsearch on data nodes: 
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.noarch.rpm

rpm -ivh elasticsearch- 0.90 . 7 .noarch.rpm

Setup and configure Elasticsearch

Update the following properties in  /etc/elasticsearch/elasticsearch.yml

  • Set cluster name  node.name: "logsearch"
  • Set node name  node.name: "node1"
  • By default every node is eligible to be master and stores data. Properties can be adjusted by
    • node.master: true
    • node.data: true
  • Number of shards can be adjusted by following property index.number_of_shards: 5
  • Number of replicas (Additional copies) can be set with index.number_of_replicas : 1
  • Adjust the path of data with  path.data: /data1,/data2,/data3,/data4
  • Set to ensure a node sees N other master eligible nodes to be considered. This property needs to be set based on the size of the nodes discovery.zen.minimum_master_nodes: 1
  • Set the time to wait for ping responses from other nodes when discovering. Value needs to be higher for slow or congested network discovery.zen.ping.timeout: 3s
  • Disable the following, only if multicast is not supported in the network discovery.zen.ping.multicast.enabled: false

Note:  Configure an initial list of master nodes in the cluster, if multicast is disabled discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

Logging properties can be adjusted in /etc/elasticsearch/logging.yml . The default log location is: /var/log/elasticsearch

Starting and Stopping Elasticsearch

  • To start Elasticsearch /etc/init.d/elasticsearch start
  • To stop Elasticsearch /etc/init.d/elasticsearch stop

Kibana Installation

Download the Kibana binaries from the following URL https://download.elasticsearch.org/kibana/kibana/kibana-3.0.0milestone4.tar.gz

wget https: //download.elasticsearch.org/kibana/kibana/kibana-3.0.0milestone4.tar.gz

Extract archive with  tar –zxvf kibana- 3.0 .0milestone4.tar.gz

Setup and configure Kibana

  • Open config.js file under the extracted directory
  • Set the  elasticsearch  parameter to the fully qualified hostname or IP of your Elasticsearch server.
  • elasticsearch: http://:9200
  • Open index.html in your browser to access Kibana UI
  • Update the logstash index pattern to Flume supported index pattern
  • Edit  app/dashboards/logstash.json and replace all occurences of[logstash-]YYYY.MM.DD with [logstash-]YYYY-MM-DD

Setup and configure Flume

For demonstration purpose, lets setup and configure a Flume agent on a host where log file needs to be consumed with the following Flume configuration.

Create plugins.d directory and copy the Elasticsearch dependencies:

mkdir /usr/lib/flume/plugins.d 
cp $elasticsearch_home/lib/elasticsearch- 0.90 *jar /usr/lib/flume/plugins.d 
cp $elasticsearch_home/lib/lucene-core-*jar /usr/lib/flume/plugins.d

Update Flume configuration to consume a local file and index into Elasticsearch in logstash format. Note: in a real-world use cases, Flume Log4j Appender, Syslog TCP Source, Flume Client SDK, Spool Directory Source are preferred over tailing logs.

agent.sources = tail

agent.channels = memoryChannel

agent.channels.memoryChannel.type = memory

agent.sources.tail.channels = memoryChannel

agent.sources.tail.type = exec

agent.sources.tail.command = tail -F /tmp/es_log.log

agent.sources.tail.interceptors=i1 i2 i3

agent.sources.tail.interceptors.i1.type=regex_extractor

agent.sources.tail.interceptors.i1.regex = (\\w.*):(\\w.*):(\\w.*)\\s

agent.sources.tail.interceptors.i1.serializers = s1 s2 s3

agent.sources.tail.interceptors.i1.serializers.s1.name = source

agent.sources.tail.interceptors.i1.serializers.s2.name = type

agent.sources.tail.interceptors.i1.serializers.s3.name = src_path

agent.sources.tail.interceptors.i2.type=org.apache.flume.interceptor.TimestampInterceptor$Builder

agent.sources.tail.interceptors.i3.type=org.apache.flume.interceptor.HostInterceptor$Builder

agent.sources.tail.interceptors.i3.hostHeader = host

agent.sinks = elasticsearch

agent.sinks.elasticsearch.channel = memoryChannel

agent.sinks.elasticsearch.type=org.apache.flume.sink.elasticsearch.ElasticSearchSink

agent.sinks.elasticsearch.batchSize= 100

agent.sinks.elasticsearch.hostNames = 172.16 . 55.129 : 9300

agent.sinks.elasticsearch.indexName = logstash

agent.sinks.elasticsearch.clusterName = logsearch

agent.sinks.elasticsearch.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer

Prepare sample data for a simple test

Create a file /tmp/es_log.log with the following data

website:weblog:login_page weblog data1

website:weblog:profile_page weblog data2

website:weblog:transaction_page weblog data3

website:weblog:docs_page weblog data4

syslog:syslog:sysloggroup syslog data1

syslog:syslog:sysloggroup syslog data2

syslog:syslog:sysloggroup syslog data3

syslog:syslog:sysloggroup syslog data4

Restart Flume

/etc/init.d/flume-agent restart

Searching and Dashboarding with Kibana

Open the $KIBANA_HOME/index.html in browser. By default the welcome page is shown.

es2
Click on “Logstash Dashboard”  and select the appropriate time range to look at the charts based on the time stamped fields.

es3

These screen shots show various available charts on search fields. e.g. Pie, Bar, Table charts

es4es5

Content can be searched with custom filters and graphs can be plotted based on the search results as shown below.

es6

Batch Indexing using MapReduce/Hive/Pig

Elasticsearch’s real-time search and analytics are natively integrated with Hadoop. and support  MapReduce ,  Cascading ,  Hive  and  Pig .

Component Implementation notes
MR2/YARN ESInputFormatESOutputFormat Mapreduce input and out formats are provided by the library
Hive org.elasticsearch.hadoop.hive.ESStorageHandler Hive SerDe implementation
Pig org.elasticsearch.hadoop.pig.ESStorage Pig storage handler

Detailed Documentation with examples related to Elasticsearch hadoop integration can be found in the following URL https://github.com/elasticsearch/elasticsearch-hadoop

Thoughts on Best Practices

  1. Always set minimum_master_nodes to higher to avoid split brain (number of nodes / 2 + 1)
  2. discovery.zen.minimum_master_nodes should be set to something like N/2 + 1 where N is the number of available master nodes.
  3. Set action.disable_delete_all_indices to disable accidental deletes
  4. Set gateway.recover_after_nodes to appropriate number of nodes need to be up before the recovery process starts replicating data around the cluster.
  5. Relax the real time aspect from 1 second to something a bit higher (index.engine.robin.refresh_interval ).
  6. Increase the memory allocated to Elasticsearch node. By default its 1g.
  7. Use Java 7 if possible for better performance with elastic search
  8. Set index.fielddata.cache : soft to avoid OutOfMemory errors
  9. Use higher batch sizes in flume sink for higher throughput. E.g 1000
  10. Increase the open file limits for Elasticsearch

使用Hortonworks的Hadoop发行版(hdp)在Windows系统上安装Hadoop集群

以下内容若无特殊指明,针对的对象默认是集群中每台机器。 一、操作系统和hdp版本 Windows 10专业版(64位),hdp-2.4.0.0(http://hortonworks.com/downl...
  • winfield821
  • winfield821
  • 2016年07月31日 16:59
  • 1969

HDP学习--Ambari安装Hadoop集群步骤

HDP学习--Ambari安装Hadoop集群步骤
  • wuxintdrh
  • wuxintdrh
  • 2016年12月28日 02:11
  • 504

Hadoop各商业发行版之比较

Hadoop的发行版除了社区的Apache hadoop外,cloudera,hortonworks,mapR,EMC,IBM,INTEL,华为等等都提供了自己的商业版本。商业版主要是提供了专业的技术...
  • burpee
  • burpee
  • 2016年05月20日 15:12
  • 9955

Hortonworks(HDP)关闭不需要的组件(服务)

Hortonworks在/etc/init.d/下面有一个用于启动HDP各项服务的脚本:startup_script,打开这个脚本我们可以看到它又调用了/usr/lib/hue/tools/start...
  • bluishglc
  • bluishglc
  • 2014年12月20日 23:22
  • 8567

Hive做数据仓库,对Hadoop Hive 的多用户的资源分配和权限管理 (Hortonworks HDP2.2 hadoop 2.6)

我正在搭建一个基于Hadoop/hive的数据仓库方案。 使用的Hortonworks的HDP2.2 版本。   数据仓库主要的使用场景(use case) 是 1. 从各数据源通过ETL 将数据汇...
  • tiimfei
  • tiimfei
  • 2015年07月29日 15:37
  • 2306

大数据道场(HDP SandBox) 初探

这里的大数据道场是以HDP sandbox 为基础的,安装好了virtual box,导入了sandbox镜像之后,启动虚拟机,来看看我们的大数据道场吧。访问方式......文件传输......两句话...
  • wireless_com
  • wireless_com
  • 2016年09月27日 20:39
  • 2697

自学大数据:用以生产环境的Hadoop版本比较

一、背景介绍 生产环境中,hadoop的版本选择是一个公司架构之时,很重要的一个考虑因素。这篇文章根据就谈谈现在主流的hadoop版本的比较。如果有不同意见,或者指正,希望大家能交流。 ...
  • fhx007
  • fhx007
  • 2015年02月01日 19:39
  • 4220

ambari离线安装以及hadoop环境搭建详细过程

ambari离线安装以及hadoop集群搭建详细过程
  • jacklin929
  • jacklin929
  • 2016年10月14日 14:13
  • 4182

Ambari 2.1安装HDP2.3.2 之 六、安装部署HDP集群 详细步骤

六、安装部署HDP集群浏览器访问 http://master:8080,进入amabri登录页面,用户名:admin,密码: admin选择 Launch Install Wizard: 1. G...
  • sinat_28224453
  • sinat_28224453
  • 2016年05月30日 15:35
  • 10899

用以生产环境的Hadoop版本比较

一、背景介绍生产环境中,Hadoop的版本选择是一个公司架构之时,很重要的一个考虑因素。这篇文章根据就谈谈现在主流的hadoop版本的比较。如果有不同意见,或者指正,希望大家能交流。Apache Ha...
  • lichangzai
  • lichangzai
  • 2016年06月02日 11:06
  • 9104
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:How To Configure Elasticsearch on Hadoop with HDP
举报原因:
原因补充:

(最多只允许输入30个字)