Flume: Gathering customer product search clicks data using Apache Flume

repost from: http://www.javacodegeeks.com/2014/05/flume-gathering-customer-product-search-clicks-data-using-apache-flume.html

Flume: Gathering customer product search clicks data using Apache Flume

This post covers to use Apache flume to gather customer product search clicks and store the information using hadoop and elasticsearch sinks. The data may consist of different product search events like filtering based on different facets, sorting information, pagination information and further the products viewed and some of the products marked as favorite by the customers. In later posts we will analyze data further to use the same information for display and analytic.

Product Search Functionality

Any eCommerce platform offers different products to customers and search functionality is one of the basics of that. Allowing user for guided navigation using different facets/filters or free text search for the content is trivial of the any of existing search functionality.

SearchQueryInstruction

Consider a similar scenario where customer can search for a product and allows us to capture the product search behavior with following information,

01 public class SearchQueryInstruction implements Serializable {
02     @JsonIgnore
03     private final String _eventIdSuffix;
04     private String eventId;
05     private String hostedMachineName;
06     private String pageUrl;
07     private Long customerId;
08     private String sessionId;
09     private String queryString;
10     private String sortOrder;
11     private Long pageNumber;
12     private Long totalHits;
13     private Long hitsShown;
14     private final Long createdTimeStampInMillis;
15     private String clickedDocId;
16     private Boolean favourite;
17     @JsonIgnore
18     private Map<String, Set<String>> filters;
19     @JsonProperty(value = "filters")
20     private List<FacetFilter> _filters;
21  
22     public SearchQueryInstruction() {
23         _eventIdSuffix = UUID.randomUUID().toString();
24         createdTimeStampInMillis = new Date().getTime();
25     }
26     ...
27     ...
28  
29     private static class FacetFilter implements Serializable {
30         private String code;
31         private String value;
32  
33         public FacetFilter(String code, String value) {
34             this.code = code;
35             this.value = value;
36         }
37         ...
38         ...
39     }
40 }

Further source information available at, SearchQueryInstruction. The data is serialized in JSON format to be able to directly use with ElasticSearch for further display purposes.

Sample data, how the clicks information look like based on user clicks. The data is converted to json format before sending it to the embedded flume agent.

1 {"eventid":"629e9b5f-ff4a-4168-8664-6c8df8214aa7-1399386809805-24","hostedmachinename":"192.168.182.1330","pageurl":"http://jaibigdata.com/5","customerid":24,"sessionid":"648a011d-570e-48ef-bccc-84129c9fa400","querystring":null,"sortorder":"desc","pagenumber":3,"totalhits":28,"hitsshown":7,"createdtimestampinmillis":1399386809805,"clickeddocid":"41","favourite":null,"eventidsuffix":"629e9b5f-ff4a-4168-8664-6c8df8214aa7","filters":[{"code":"searchfacettype_color_level_2","value":"Blue"},{"code":"searchfacettype_age_level_2","value":"12-18 years"}]}
2 {"eventid":"648b5cf7-7ca9-4664-915d-23b0d45facc4-1399386809782-298","hostedmachinename":"192.168.182.1333","pageurl":"http://jaibigdata.com/4","customerid":298,"sessionid":"7bf042ea-526a-4633-84cd-55e0984ea2cb","querystring":"queryString48","sortorder":"desc","pagenumber":0,"totalhits":29,"hitsshown":19,"createdtimestampinmillis":1399386809782,"clickeddocid":"9","favourite":null,"eventidsuffix":"648b5cf7-7ca9-4664-915d-23b0d45facc4","filters":[{"code":"searchfacettype_color_level_2","value":"Green"}]}
3 {"eventid":"74bb7cfe-5f8c-4996-9700-0c387249a134-1399386809799-440","hostedmachinename":"192.168.182.1330","pageurl":"http://jaibigdata.com/1","customerid":440,"sessionid":"940c9a0f-a9b2-4f1d-b114-511ac11bf2bb","querystring":"queryString16","sortorder":"asc","pagenumber":3,"totalhits":5,"hitsshown":32,"createdtimestampinmillis":1399386809799,"clickeddocid":null,"favourite":null,"eventidsuffix":"74bb7cfe-5f8c-4996-9700-0c387249a134","filters":[{"code":"searchfacettype_brand_level_2","value":"Apple"}]}
4 {"eventid":"9da05913-84b1-4a74-89ed-5b6ec6389cce-1399386809828-143","hostedmachinename":"192.168.182.1332","pageurl":"http://jaibigdata.com/1","customerid":143,"sessionid":"08a4a36f-2535-4b0e-b86a-cf180202829b","querystring":null,"sortorder":"desc","pagenumber":0,"totalhits":21,"hitsshown":34,"createdtimestampinmillis":1399386809828,"clickeddocid":"38","favourite":true,"eventidsuffix":"9da05913-84b1-4a74-89ed-5b6ec6389cce","filters":[{"code":"searchfacettype_color_level_2","value":"Blue"},{"code":"product_price_range","value":"10.0 - 20.0"}]}

Apache Flume

Apache Flume is used to gather and aggregate data. Here Embedded Flume agent is used to capture Search Query instruction Events. In real scenario based on the usage,

  • Either you can use embedded agent to collect data
  • Or through rest api to push data from page to backend api service dedicated for events collections
  • Or you can use application logging functionality to log all search events and tail the log file to collect data

Consider a scenario depending on application, multiple web/app servers sending events data to collector flume agent. As depicted in the diagram below the search clicks events are collected from multiple web/app servers and a collector/consolidator agent to collect data from all agents. The data is further divided based on selector using multiplexing strategy to store in Hadoop HDFS and also directing relevant data to ElasticSearch, eg. recently viewed items.

flume-dataflow-agent-sinks

Embedded Flume Agent

Embedded Flume Agent allows us to include the flume agent within the application itself and allows us to collect data and send further to collector agent.

01 private static EmbeddedAgent agent;
02     private void createAgent() {
03         final Map<String, String> properties = new HashMap<String, String>();
04         properties.put("channel.type""memory");
05         properties.put("channel.capacity""100000");
06         properties.put("channel.transactionCapacity""1000");
07         properties.put("sinks""sink1");
08         properties.put("sink1.type""avro");
09         properties.put("sink1.hostname""localhost");
10         properties.put("sink1.port""44444");
11         properties.put("processor.type""default");
12         try {
13             agent = new EmbeddedAgent("searchqueryagent");
14             agent.configure(properties);
15             agent.start();
16         catch (final Exception ex) {
17             LOG.error("Error creating agent!", ex);
18         }
19     }

Store Search Events Data

Flume provides multiple sink option to store the data for future analysis. As shown in the diagram, we will take the scenario to store the data in Apache Hadoop and also on ElasticSearch for recently viewed items functionality.

Hadoop Sink

Allows to store the data permanently to HDFS to be able to analyze it later for analytics.
Based on the incoming events data, let’s say we want to store same based on hourly basis. “/searchevents/2014/05/15/16″ directory will store all incoming events for hour 16.

01 private HDFSEventSink sink;
02         sink = new HDFSEventSink();
03         sink.setName("HDFSEventSink-" + UUID.randomUUID());
04         channel = new MemoryChannel();
05         Map<String, String> channelParamters = new HashMap<>();
06         channelParamters.put("capacity""100000");
07         channelParamters.put("transactionCapacity""1000");
08         Context channelContext = new Context(channelParamters);
09         Configurables.configure(channel, channelContext);
10         channel.setName("HDFSEventSinkChannel-" + UUID.randomUUID());
11  
12         Map<String, String> paramters = new HashMap<>();
13         paramters.put("hdfs.type""hdfs");
14         String hdfsBasePath = hadoopClusterService.getHDFSUri()
15                 "/searchevents";
16         paramters.put("hdfs.path", hdfsBasePath + "/%Y/%m/%d/%H");
17         paramters.put("hdfs.filePrefix""searchevents");
18         paramters.put("hdfs.fileType""DataStream");
19         paramters.put("hdfs.rollInterval""0");
20         paramters.put("hdfs.rollSize""0");
21         paramters.put("hdfs.idleTimeout""1");
22         paramters.put("hdfs.rollCount""0");
23         paramters.put("hdfs.batchSize""1000");
24         paramters.put("hdfs.useLocalTimeStamp""true");
25  
26         Context sinkContext = new Context(paramters);
27         sink.configure(sinkContext);
28         sink.setChannel(channel);
29  
30         sink.start();
31         channel.start();

Check FlumeHDFSSinkServiceImpl.java for detailed start/stop of the hdfs sink.

Sample data below, is stored in hadoop like,

2 body is:{"eventid":"e8470a00-c869-4a90-89f2-f550522f8f52-1399386809212-72","hostedmachinename":"192.168.182.1334","pageurl":"http://jaibigdata.com/0","customerid":72,"sessionid":"7871a55c-a950-4394-bf5f-d2179a553575","querystring":null,"sortorder":"desc","pagenumber":0,"totalhits":8,"hitsshown":44,"createdtimestampinmillis":1399386809212,"clickeddocid":"23","favourite":null,"eventidsuffix":"e8470a00-c869-4a90-89f2-f550522f8f52","filters":[{"code":"searchfacettype_brand_level_2","value":"Apple"},{"code":"searchfacettype_color_level_2","value":"Blue"}]}
3 body is:{"eventid":"2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0-1399386809743-61","hostedmachinename":"192.168.182.1330","pageurl":"http://jaibigdata.com/0","customerid":61,"sessionid":"78286f6d-cc1e-489c-85ce-a7de8419d628","querystring":"queryString59","sortorder":"asc","pagenumber":3,"totalhits":32,"hitsshown":9,"createdtimestampinmillis":1399386809743,"clickeddocid":null,"favourite":null,"eventidsuffix":"2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0","filters":[{"code":"searchfacettype_age_level_2","value":"0-12 years"}]}

ElasticSearch Sink

For view purpose to display recently viewed items to end user. The ElasticSearch Sink allows to automatically create daily recently viewed items. The functionality can be used to display customer recently  viewed items.
Let’s say you already have ES instance running at localhost/9310.

01 private ElasticSearchSink sink;
02         sink = new ElasticSearchSink();
03         sink.setName("ElasticSearchSink-" + UUID.randomUUID());
04         channel = new MemoryChannel();
05         Map<String, String> channelParamters = new HashMap<>();
06         channelParamters.put("capacity""100000");
07         channelParamters.put("transactionCapacity""1000");
08         Context channelContext = new Context(channelParamters);
09         Configurables.configure(channel, channelContext);
10         channel.setName("ElasticSearchSinkChannel-" + UUID.randomUUID());
11  
12         Map<String, String> paramters = new HashMap<>();
13         paramters.put(ElasticSearchSinkConstants.HOSTNAMES, "127.0.0.1:9310");
14         String indexNamePrefix = "recentlyviewed";
15         paramters.put(ElasticSearchSinkConstants.INDEX_NAME, indexNamePrefix);
16         paramters.put(ElasticSearchSinkConstants.INDEX_TYPE, "clickevent");
17         paramters.put(ElasticSearchSinkConstants.CLUSTER_NAME,
18                 "jai-testclusterName");
19         paramters.put(ElasticSearchSinkConstants.BATCH_SIZE, "10");
20         paramters.put(ElasticSearchSinkConstants.SERIALIZER,
21                 ElasticSearchJsonBodyEventSerializer.class.getName());
22  
23         Context sinkContext = new Context(paramters);
24         sink.configure(sinkContext);
25         sink.setChannel(channel);
26  
27         sink.start();
28         channel.start();

Check FlumeESSinkServiceImpl.java for details to start/stop the ElasticSearch sink.

Sample data in elasticsearch is stored as,

1 {timestamp=1399386809743, body={pageurl=http://jaibigdata.com/0, querystring=queryString59, pagenumber=3, hitsshown=9, hostedmachinename=192.168.182.1330, createdtimestampinmillis=1399386809743, sessionid=78286f6d-cc1e-489c-85ce-a7de8419d628, eventid=2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0-1399386809743-61, totalhits=32, clickeddocid=null, customerid=61, sortorder=asc, favourite=null, eventidsuffix=2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0, filters=[{value=0-12 years, code=searchfacettype_age_level_2}]}, eventId=2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0}
2 {timestamp=1399386809757, body={pageurl=http://jaibigdata.com/1, querystring=null, pagenumber=1, hitsshown=34, hostedmachinename=192.168.182.1330, createdtimestampinmillis=1399386809757, sessionid=e6a3fd51-fe07-4e21-8574-ce5ab8bfbd68, eventid=fe5279b7-0bce-4e2b-ad15-8b94107aa792-1399386809757-134, totalhits=9, clickeddocid=22, customerid=134, sortorder=desc, favourite=null, eventidsuffix=fe5279b7-0bce-4e2b-ad15-8b94107aa792, filters=[{value=Blue, code=searchfacettype_color_level_2}]}, State=VIEWED, eventId=fe5279b7-0bce-4e2b-ad15-8b94107aa792}
3 {timestamp=1399386809765, body={pageurl=http://jaibigdata.com/0, querystring=null, pagenumber=4, hitsshown=2, hostedmachinename=192.168.182.1331, createdtimestampinmillis=1399386809765, sessionid=29864de8-5708-40ab-a78b-4fae55698b01, eventid=886e9a28-4c8c-4e8c-a866-e86f685ecc54-1399386809765-317, totalhits=2, clickeddocid=null, customerid=317, sortorder=asc, favourite=null, eventidsuffix=886e9a28-4c8c-4e8c-a866-e86f685ecc54, filters=[{value=0-12 years, code=searchfacettype_age_level_2}, {value=0.0 - 10.0, code=product_price_range}]}, eventId=886e9a28-4c8c-4e8c-a866-e86f685ecc54}

ElasticSearchJsonBodyEventSerializer

To control how the data will be indexed in the ElasticSearch. Update event searializer as per your strategy to see how data should be indexed.

01 public class ElasticSearchJsonBodyEventSerializer implements ElasticSearchEventSerializer {
02     @Override
03     public BytesStream getContentBuilder(final Event event) throws IOException {
04         final XContentBuilder builder = jsonBuilder().startObject();
05         appendBody(builder, event);
06         appendHeaders(builder, event);
07         return builder;
08     }
09     ...
10     ...
11 }

Check ElasticSearchJsonBodyEventSerializer.java to configure the serializer to index data.

Let’s take java example to create Flume source to process the above SearchQueryInstruction in test cases and store the data.

Avro Source with channel selector

For testing purpose, let’s create the Avro source to redirect data to relevant sinks based on flume multiplexing feature.

01 //Avro source to start at below port and process incoming data.
02         private AvroSource avroSource;
03         final Map<String, String> properties = new HashMap<String, String>();
04         properties.put("type""avro");
05         properties.put("bind""localhost");
06         properties.put("port""44444");
07  
08         avroSource = new AvroSource();
09         avroSource.setName("AvroSource-" + UUID.randomUUID());
10         Context sourceContext = new Context(properties);
11         avroSource.configure(sourceContext);
12         ChannelSelector selector = new MultiplexingChannelSelector();
13  
14         //Channels from above services
15         Channel ESChannel = flumeESSinkService.getChannel();
16         Channel HDFSChannel = flumeHDFSSinkService.getChannel();
17         List<Channel> channels = new ArrayList<>();
18         channels.add(ESChannel);
19         channels.add(HDFSChannel);
20         selector.setChannels(channels);
21         final Map<String, String> selectorProperties = new HashMap<String, String>();
22         selectorProperties.put("type""multiplexing");
23         selectorProperties.put("header""State");
24         selectorProperties.put("mapping.VIEWED", HDFSChannel.getName() + " "
25                 + ESChannel.getName());
26         selectorProperties.put("mapping.FAVOURITE", HDFSChannel.getName() + " "
27                 + ESChannel.getName());
28         selectorProperties.put("default", HDFSChannel.getName());
29         Context selectorContext = new Context(selectorProperties);
30         selector.configure(selectorContext);
31         ChannelProcessor cp = new ChannelProcessor(selector);
32         avroSource.setChannelProcessor(cp);
33  
34         avroSource.start();

Check FlumeAgentServiceImpl.java to directly store data to above configured sinks or even to log all data to a log file.

Standalone Flume/Hadoop/ElasticSearch environment

The application can be used to generate SearchQueryInstruction data and you can use your own standalone environment to process data further. In case you already have running Flume/Hadoop/ElasticSearch environment, use below settings to process the data further.

The following configuration (flume.conf) can also be used if you already have Flume instance running,

01 # Name the components on this agent
02 searcheventscollectoragent.sources = eventsavrosource
03 searcheventscollectoragent.sinks = hdfssink essink
04 searcheventscollectoragent.channels = hdfschannel eschannel
05  
06 # Bind the source and sink to the channel
07 searcheventscollectoragent.sources.eventsavrosource.channels = hdfschannel eschannel
08 searcheventscollectoragent.sinks.hdfssink.channel = hdfschannel
09 searcheventscollectoragent.sinks.essink.channel = eschannel
10  
11 #Avro source. This is where data will send data to.
12 searcheventscollectoragent.sources.eventsavrosource.type = avro
13 searcheventscollectoragent.sources.eventsavrosource.bind = 0.0.0.0
14 searcheventscollectoragent.sources.eventsavrosource.port = 44444
15 searcheventscollectoragent.sources.eventsavrosource.selector.type = multiplexing
16 searcheventscollectoragent.sources.eventsavrosource.selector.header = State
17 searcheventscollectoragent.sources.eventsavrosource.selector.mapping.VIEWED = hdfschannel eschannel
18 searcheventscollectoragent.sources.eventsavrosource.selector.mapping.default = hdfschannel
19  
20 # Use a channel which buffers events in memory. This will keep all incoming stuff in memory. You may change this to file etc. in case of too much data coming and memory an issue.
21 searcheventscollectoragent.channels.hdfschannel.type = memory
22 searcheventscollectoragent.channels.hdfschannel.capacity = 100000
23 searcheventscollectoragent.channels.hdfschannel.transactionCapacity = 1000
24  
25 searcheventscollectoragent.channels.eschannel.type = memory
26 searcheventscollectoragent.channels.eschannel.capacity = 100000
27 searcheventscollectoragent.channels.eschannel.transactionCapacity = 1000
28  
29 #HDFS sink. Store events directly to hadoop file system.
30 searcheventscollectoragent.sinks.hdfssink.type = hdfs
31 searcheventscollectoragent.sinks.hdfssink.hdfs.path = hdfs://localhost.localdomain:54321/searchevents/%Y/%m/%d/%H
32 searcheventscollectoragent.sinks.hdfssink.hdfs.filePrefix = searchevents
33 searcheventscollectoragent.sinks.hdfssink.hdfs.fileType = DataStream
34 searcheventscollectoragent.sinks.hdfssink.hdfs.rollInterval = 0
35 searcheventscollectoragent.sinks.hdfssink.hdfs.rollSize = 134217728
36 searcheventscollectoragent.sinks.hdfssink.hdfs.idleTimeout = 60
37 searcheventscollectoragent.sinks.hdfssink.hdfs.rollCount = 0
38 searcheventscollectoragent.sinks.hdfssink.hdfs.batchSize = 10
39 searcheventscollectoragent.sinks.hdfssink.hdfs.useLocalTimeStamp = true
40  
41 #Elastic search
42 searcheventscollectoragent.sinks.essink.type = elasticsearch
43 searcheventscollectoragent.sinks.essink.hostNames = 127.0.0.1:9310
44 searcheventscollectoragent.sinks.essink.indexName = recentlyviewed
45 searcheventscollectoragent.sinks.essink.indexType = clickevent
46 searcheventscollectoragent.sinks.essink.clusterName = jai-testclusterName
47 searcheventscollectoragent.sinks.essink.batchSize = 10
48 searcheventscollectoragent.sinks.essink.ttl = 5
49 searcheventscollectoragent.sinks.essink.serializer = org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer

To test the application how the search query instructions behave on your existing hadoop instance, setup the hadoop and elasticsearch instances separately. The application uses Cloudera hadoop distribution 5.0 for testing purpose.

In later post we will cover to analyze the generated data further,

  • Using Hive query the data for top customer queries and number of times a product viewed.
  • Using ElasticSearch Hadoop to index customer top queries and product views data
  • Using Pig to count total number of unique customers
  • Using Oozie to  schedule coordinated jobs for hive partition and bundle job to index data to ElasticSearch.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值