elasticsearch也能够存储大量的数据,和hdfs相比有很大的优势:使用hdfs进行数据分析的时候,需要将所有的数据全部都加载出来,然后用一个filter进行过滤,这个时候占用了大量的资源。有些时候,只是从大量的数据中过滤出很少一部分数据,elasticsearch能够通过查询条件,将想要的数据结果返回给你,这样不会占用大量的资源,所以效率会比较高。
通过logstash将kafka中的数据加载到elasticsearch,需要配置logstash的配置文件(kafka-es.conf):
input {
kafka {
type => "level-one"
auto_offset_reset => "smallest"
codec => plain {
charset => "UTF-8"
}
group_id => "es"
topic_id => "gsTopic03"
zk_connect => "mini02:2181,mini03:2181,mini04:2181"
}
}
filter {
mutate {
split => { "message" => " " }
add_field => {
"event_type" => "%{message[3]}"
"current_map" => "%{message[4]}"
"current_X" => "%{message[5]}"
"current_y" => "%{message[6]}"
"user" => "%{message[7]}"
"item" => "%{message[8]}"
"item_id" => "%{message[9]}"
"current_time" => "%{message[12]}"
}
#原来的message不要了
remove_field => [ "message" ]
}
}
output {
elasticsearch {
index => "level-one-%{+YYYY.MM.dd}"
codec => plain {
charset => "UTF-8"
}
hosts => ["mini02:9200", "mini02:9200", "mini02:9200"]
}
}
在集群中启动elasticsearch:
./elasticsearch/bin/elasticsearch -d