Flume: event collector, typical usage is log collection.
Solr: search engine based on Lucene
Function: watch /var/log/a1.new.log file. If new lines append to this file, it will send the event( new lines) to flume source and make index of the event , then send to solr engine. You can quickly search the new event by solr.
Download:
flume 1.4: http://archive.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz
solr 4.3: http://archive.apache.org/dist/lucene/solr/4.3.0/
solr 4.3: http://archive.apache.org/dist/lucene/solr/4.3.0/
a1.new.log's format like following:
# cat /var/log/a1.new.log
2014-05-29 10:37:56,777 INFO org.apache.hadoop.http.HttpServer: HttpServer.start() threw a non Bind IOException
2014-05-15 19:06:52,373 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2014-05-29 10:37:56,777 INFO org.apache.hadoop.http.HttpServer: HttpServer.start() threw a non Bind IOException
2014-05-15 19:06:52,373 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
Configure solr (server-1941/192.168.100.110)
- extract solr
- cd /usr/lib
- tar zxvf solr-4.3.0.tgz
- configure solr cloud
- cd solr-4.3.0
- cp -r example node1
- cd node1
- vi solr/zoo.cfg uncomment "clientPort=2181"
- edit /solr/collection1/conf/schema.xml
- add following to fields element: <!-- add start because of flume -->
<field name="timestamp" type="string" indexed="true" stored="true"/>
<field name="loglevel" type="string" indexed="true" stored="true"/>
<field name="classname" type="string" indexed="true" stored="true"/>
<field name="msg" type="string" indexed="true" stored="true"/>
<!-- add end in because of flume --> - ucomment following lines in /solr/collection1/conf/solrconfig.xml. Why this? please reference " If you want to use the Near Realtime search support, you will probably want to enable auto soft commits in your solrconfig.xml file before putting it into zookeeper.(http://wiki.apache.org/solr/SolrCloud)"
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
- add following to fields element: <!-- add start because of flume -->
- start solr: java -DzkRun -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -jar start.jar
- brownse :http://192.168.100.110:8983/solr/#/
Configure flume
(server-1941/192.168.100.110)
1. extract
apache-flume-1.4.0-bin.tar.gz
- cd /usr/lib/
- tar zxvf apache-flume-1.4.0-bin.tar.gz
2. edit flume-env.sh
- cp conf/flume-env.sh.template conf/flume-env.sh
- edit conf/flume-env.sh , add following in flume-env.sh JAVA_OPTS="-Xms256m -Xmx512m"
- edit conf/flume-conf-morphlineSolr.properties
a1.channels = c1
a1.sources = r1
a1.sinks = k1
a1.channels.c1.type = memory
a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/a1.new.log
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
a1.sinks.k1.morphlineFile = /usr/lib/apache-flume-1.4.0-bin/conf/morphline.conf
a1.channels.MemChannel.type = memory
a1.channels.MemChannel.capacity = 10000
a1.channels.MemChannel.transactionCapacity = 100
-
edit conf/morphline.confmorphlines : [
{
# Name used to identify a morphline. E.g. used if there are multiple
# morphlines in a morphline config file
id : morphline1
# Import all morphline commands in these java packages and their
# subpackages. Other commands that may be present on the classpath are
# not visible to this morphline.
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{
# Parse input attachment and emit a record for each input line
readLine {
charset : UTF-8
}
}
{
grok {
# Consume the output record of the previous command and pipe another
# record downstream.
#
# A grok-dictionary is a config file that contains prefabricated
# regular expressions that can be referred to by name. grok patterns
# specify such a regex name, plus an optional output field name.
# The syntax is %{REGEX_NAME:OUTPUT_FIELD_NAME}
# The input line is expected in the "message" input field.
#dictionaryFiles : [src/test/resources/grok-dictionaries]
dictionaryFiles :[/usr/lib/apache-flume-1.4.0-bin/conf/grok-dictionaries]
expressions : {
#message : """%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:hostname} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:msg}"""
#message : """%{TIMESTAMP_LOG:timestamp} %{SYSLOGHOST:hostname} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:msg}"""
message : """%{TIMESTAMP_LOG:timestamp} %{LOGLEVEL:loglevel} %{DATA:classname}: %{GREEDYDATA:msg}"""
}
}
}
# Consume the output record of the previous command, convert
# the timestamp, and pipe another record downstream.
#
# convert timestamp field to native Solr timestamp format
# e.g. 2012-09-06T07:14:34Z to 2012-09-06T07:14:34.000Z
{
convertTimestamp {
field : timestamp
inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", "yyyy-MM-dd HH:mm:ss,SSS"]
#inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"]
inputTimezone : America/Los_Angeles
outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
outputTimezone : UTC
}
}
{
generateUUID {
field : id
}
}
# Consume the output record of the previous command, transform it
# and pipe the record downstream.
#
# This command deletes record fields that are unknown to Solr
# schema.xml. Recall that Solr throws an exception on any attempt to
# load a document that contains a field that isn't specified in
# schema.xml.
{
sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : {
collection : collection1 # Name of solr collection
zkHost : "127.0.0.1:2181/" # ZooKeeper ensemble
}
}
}
# log the record at INFO level to SLF4J
{ logInfo { format : "output record: {}", args : ["@{}"] } }
# load the record into a Solr server or MapReduce Reducer
{
loadSolr {
solrLocator : {
collection : collection1 # Name of solr collection
zkHost : "127.0.0.1:2181/" # ZooKeeper ensemble
}
}
}
]
}
] - start flume
- cd /usr/lib/apache-flume-1.4.0-bin/
- start flume
- ./bin/flume-ng agent --conf conf --conf-file conf/flume-conf-morphlineSolr.properties --name a1 -Dflume.root.logger=INFO,console
- How to test
- curl -g http://192.168.100.110:8983/solr/collection1/select?q=msg:*hadoop*&wt=xml&indent=true
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">4</int><lst name="params"><str name="q">msg:*hadoop</str></lst></lst><result name="response" numFound="0" start="0"></result>
</response> - append new line to /var/log/a1.log
- echo "2014-06-03 10:16:52,373 INFO org.apache.hadoop.util.ExitUtil: hadoop will shutdown">>/var/log/a1.new.log
- curl -g http://192.168.100.110:8983/solr/collection1/select?q=msg:*hadoop*&wt=xml&indent=true
<?xml version="1.0" encoding="UTF-8"?><response><lst name="responseHeader"><int name="status">0</int><int name="QTime">5</int><lst name="params"><str name="q">msg:*hadoop*</str></lst></lst><result name="response" numFound="1" start="0"><doc><str name="id">63566aed-7438-4c8e-8b02-7f6fa0be85b3</str><str name="timestamp">2014-06-03T17:16:52.373Z</str><str name="msg"> hadoop will shutdown</str><long name="_version_">1469853891281551360</long></doc></result></response>
- curl -g http://192.168.100.110:8983/solr/collection1/select?q=msg:*hadoop*&wt=xml&indent=true