ElasticSearch-Hadoop: Indexingproductviews count andcustomer topsearch queryfrom Hadoop to ElasticSe

转载 2015年11月18日 18:16:29

原文地址:http://www.javacodegeeks.com/2014/05/elasticsearch-hadoop-indexing-product-views-count-and-customer-top-search-query-from-hadoop-to-elasticsearch.html


This post covers to use ElasticSearch-Hadoop to read data from Hadoop system and index that in ElasticSearch. The functionality it covers is to index product views count and top search query per customer in last n number of days. The analyzed data can further be used on website to display customer recently viewed, product views count and top search query string.

In continuation to the previous posts on

We already have customer search clicks data gathered using Flume and stored in Hadoop HDFS and ElasticSearch, and how to analyze same data using Hive and generate statistical data. Here we will further see how to use the analyzed data to enhance customer experience on website and make it relevant for the end customers.

Recently Viewed Items

We already have covered in first part, how we can use flume ElasticSearch sink to index the recently viewed items directory to ElasticSearch instance and the data can be used to display real time clicked items for the customer.

ElasticSearch-Hadoop

Elasticsearch for Apache Hadoop allows Hadoop jobs to interact with ElasticSearch with small library and easy setup.

Elasticsearch-hadoop-hive, allows to access ElasticSearch using Hive. As shared in previous post, we have product views count and also customer top search query data extracted in Hive tables. We will read and index the same data to ElasticSearch so that it can be used for display purpose on website.

elasticsearch-hadoop-hive

Product views count functionality

Take a scenario to display each product total views by customer in the last n number of days. For better user experience, you can use the same functionality to display to end customer how other customer perceive the same product.

Hive Data for product views

Select sample data from hive table:

1 # search.search_productviews : id, productid, viewcount
2 61, 61, 15
3 48, 48, 8
4 16, 16, 40
5 85, 85, 7

Product Views Count Indexing

Create Hive external table “search_productviews_to_es” to index data to ElasticSearch instance.

1 Use search;
2 DROP TABLE IF EXISTS search_productviews_to_es;
3 CREATE EXTERNAL TABLE search_productviews_to_es (id STRING, productid BIGINT, viewcount INT) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'productviews/productview', 'es.nodes' = 'localhost', 'es.port' = '9210', 'es.input.json' = 'false', 'es.write.operation' = 'index', 'es.mapping.id' = 'id', 'es.index.auto.create' = 'yes');
4 INSERT OVERWRITE TABLE search_productviews_to_es SELECT qcust.id, qcust.productid, qcust.viewcount FROM search_productviews qcust;
  •  External table search_productviews_to_es is created points to ES instance
  •  ElasticSearch instance configration used is localhost:9210
  •  Index “productviews” and document type “productview” will be used to index data
  •  Index and mappins will automatically created if it does not exist
  •  Insert overwrite will override the data if it already exists based on id field.
  •  Data is inserting by selecting data from another hive table “search_productviews” storing analytic/statistical data.

Execute the hive script in java to index product views data, HiveSearchClicksServiceImpl.java

1 Collection<HiveScript> scripts = new ArrayList<>();
2             HiveScript script = new HiveScript(new ClassPathResource("hive/load-search_productviews_to_es.q"));
3             scripts.add(script);
4             hiveRunner.setScripts(scripts);
5             hiveRunner.call();

productviews index sample data

The sample data in ElasticSearch index is stored as below:

1 {id=48, productid=48, viewcount=10}
2 {id=49, productid=49, viewcount=20}
3 {id=5, productid=5, viewcount=18}
4 {id=6, productid=6, viewcount=9}

Customer top search query string functionality

Take a scenario, where you may want to display top search query string by a single customer or all the customers on the website. You can use the same to display top search query cloud on the website.

Hive Data for customer top search queries

Select sample data from hive table:

1 # search.search_customerquery : id, querystring, count, customerid
2 61_queryString59, queryString59, 5, 61
3 298_queryString48, queryString48, 3, 298
4 440_queryString16, queryString16, 1, 440
5 47_queryString85, queryString85, 1, 47

Customer Top search queries Indexing

Create Hive external table “search_customerquery_to_es” to index data to ElasticSearch instance.

1 Use search;
2 DROP TABLE IF EXISTS search_customerquery_to_es;
3 CREATE EXTERNAL TABLE search_customerquery_to_es (id String, customerid BIGINT, querystring String, querycount INT) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'topqueries/custquery', 'es.nodes' = 'localhost', 'es.port' = '9210', 'es.input.json' = 'false', 'es.write.operation' = 'index', 'es.mapping.id' = 'id', 'es.index.auto.create' = 'yes');
4 INSERT OVERWRITE TABLE search_customerquery_to_es SELECT qcust.id, qcust.customerid, qcust.queryString, qcust.querycount FROM search_customerquery qcust;
  •  External table search_customerquery_to_es is created points to ES instance
  •  ElasticSearch instance configration used is localhost:9210
  •  Index “topqueries” and document type “custquery” will be used to index data
  •  Index and mappins will automatically created if it does not exist
  •  Insert overwrite will override the data if it already exists based on id field.
  •  Data is inserting by selecting data from another hive table “search_customerquery” storing analytic/statistical data.

Execute the hive script in java to index data HiveSearchClicksServiceImpl.java

1 Collection<HiveScript> scripts = new ArrayList<>();
2             HiveScript script = new HiveScript(new ClassPathResource("hive/load-search_customerquery_to_es.q"));
3             scripts.add(script);
4             hiveRunner.setScripts(scripts);
5             hiveRunner.call();

topqueries index sample data

The topqueries index data on ElasticSearch instance is as shown below:

1 {id=474_queryString95, querystring=queryString95, querycount=10, customerid=474}
2 {id=482_queryString43, querystring=queryString43, querycount=5, customerid=482}
3 {id=482_queryString64, querystring=queryString64, querycount=7, customerid=482}
4 {id=483_queryString6, querystring=queryString6, querycount=2, customerid=483}
5 {id=487_queryString86, querystring=queryString86, querycount=111, customerid=487}
6 {id=494_queryString67, querystring=queryString67, querycount=1, customerid=494}

The functionality described above is only sample functionality and ofcourse need to be extended to map to specific business scenario. This may cover business scenario of displaying search query cloud to customers on website or for further Business Intelligence analytics.

Spring Data

Spring ElasticSearch for testing purpose has also been included to create ESRepository to count total records and delete All.
Check the service for details, ElasticSearchRepoServiceImpl.java

Total product views:

01 @Document(indexName = "productviews", type = "productview", indexStoreType = "fs", shards = 1, replicas = 0, refreshInterval = "-1")
02 public class ProductView {
03     @Id
04     private String id;
05     @Version
06     private Long version;
07     private Long productId;
08     private int viewCount;
09     ...
10     ...
11     }
12  
13 public interface ProductViewElasticsearchRepository extends ElasticsearchCrudRepository<ProductView, String> { }
14  
15 long count = productViewElasticsearchRepository.count();

Customer top search queries:

01 @Document(indexName = "topqueries", type = "custquery", indexStoreType = "fs", shards = 1, replicas = 0, refreshInterval = "-1")
02 public class CustomerTopQuery {
03     @Id
04     private String id;
05     @Version
06     private Long version;
07     private Long customerId;
08     private String queryString;
09     private int count;
10     ...
11     ...
12     }
13  
14 public interface TopQueryElasticsearchRepository extends ElasticsearchCrudRepository<CustomerTopQuery, String> { }
15  
16 long count = topQueryElasticsearchRepository.count();

In later posts we will cover to analyze the data further using scheduled jobs,

  • Using Oozie to schedule coordinated jobs for hive partition and bundle job to index data to ElasticSearch.
  • Using Pig to count total number of unique customers etc.

elasticsearch2.1 elasticsearch-hadoop安装

1、下载elasticsearch-hadoop-2.2.0beta1.jar,拷贝到hive的lib目录中,然后以如下方式打开hive命令窗口: bin/hive -hiveconf hi...
  • u011529104
  • u011529104
  • 2015年12月16日 17:20
  • 1501

elasticsearch的hadoop插件使用

ES的Hadoop插件,总共有3个,我们要使用的是 hadoop HDFS Snapshot/Restore plugin,它主要用于备份ES数据到HDFS,或者从HDFS恢复数据,也就是ES的sna...
  • u014783000
  • u014783000
  • 2014年11月27日 18:05
  • 2963

elasticsearch-hadoop使用记录

elasticsearch-hadoop是一个深度集成Hadoop和ElasticSearch的项目,也是ES官方来维护的一个子项目,通过实现Hadoop和ES之间的输入输出,可以在Hadoop里面对...
  • tbdp6411
  • tbdp6411
  • 2015年11月18日 18:13
  • 3809

Hadoop MapReduce 读写Elasticsearch

最近需要调研hadoop MR和ES进行交互。自然就用到了ES官方的Elasticsearch-Hadoop插件。然而官方的资料,尤其是实现部分,写的感觉不太详细。跳了点坑,然后总结了这篇文章,本文很...
  • xsdxs
  • xsdxs
  • 2016年12月19日 21:10
  • 4298

elasticsearch2.2-yarn(hadoop)安装

序言:  首先说说es2.2在yarn上跑的好处和不足,在hadoop上跑可以统一使用yarn资源,不用单独给es搞物理机了。 这里要注意es并没有使用hdfs,网上资料说可以使用hdfs来管理e...
  • willwill101
  • willwill101
  • 2016年03月01日 11:30
  • 1498

理解hadoop fsck、fs -dus、-count -q的大小输出

很多hadoop用户经常迷惑hadoop fsck,hadoop fs -dus,hadoop -count -q等hadoop文件系统命令输出的大小以及意义。 这里对这类问题做一个小结。首先我们来...
  • fly_time2012
  • fly_time2012
  • 2017年06月12日 09:42
  • 407

hadoop fs -count 命令

hadoop fs -count的结果含义 最近要对hdfs上空间使用和文件结点使用增加报警,当超过一定的限额的时候就要发报警好通知提前准备。 [sunwg]$ hadoop fs -...
  • liuxiao723846
  • liuxiao723846
  • 2017年05月02日 10:49
  • 567

Elasticsearch For Apache Hadoop (ES-Hadoop)最新介绍

连接快速搜索与大数据分析Elasticsearch For Apache Hadoop(ES-Hadoop)是解决用户既需要进行多种分析,又需要进行快速搜索的需求的。 最新版本ES-Hadoop...
  • dm520
  • dm520
  • 2016年03月23日 12:06
  • 4639

Hadoop的测试例子WordCount(含效果图)

Hadoop的测试例子WordCount(含效果图)
  • u012965373
  • u012965373
  • 2016年01月07日 23:45
  • 2020

Hadoop入门经典:WordCount

以下程序在hadoop1.2.1上测试成功。 本例先将源代码呈现,然后详细说明执行步骤,最后对源代码及执行过程进行分析。 一、源代码 package org.jediael.hadoopdemo.wo...
  • jediael_lu
  • jediael_lu
  • 2014年08月20日 14:43
  • 43232
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:ElasticSearch-Hadoop: Indexingproductviews count andcustomer topsearch queryfrom Hadoop to ElasticSe
举报原因:
原因补充:

(最多只允许输入30个字)