filebeat收集日志通过kafka发送spark存入es
filebeat收集日志通过kafka发送spark存入es
介绍:将filebeat收集的日志数据,接入spark进行更多处理,最后存入ES便于数据分析。
主要内容:
- filebeat发送至kafka的配置文件
- pyspark结构化流读取kafka
- pyspark结构化流存入Elasticsearch
启动zookeeper和kafka
# zookeeper启动
/software/server/zookeeper/apache-zookeeper-3.5.9-bin/bin/zkServer.sh start
# kafka前台:看日志
/software/server/kafka/kafka_2.11-2.4.1/bin/kafka-server-start.sh /software/server/kafka/kafka_2.11-2.4.1/config/server.properties
# kafka后台
nohup /software/server/kafka/kafka_2.11-2.4.1/bin/kafka-server-start.sh /software/server/kafka/kafka_2.11-2.4.1/config/server.properties 2>&1 &
filebeat数据发送kafka
vim /etc/filebeat/filebeat.yml
==========================
processors:
- rename:
fields:
- from: "message"
to: "originallog"
ignore_missing: false
fail_on_error: true
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/messages*
output.kafka:
# initial brokers for reading cluster metadata
hosts: ["centos7:9092"]
# message topic selection + partitioning
topic: "test01"
partition.round_robin:
reachable_only: false
required_acks: 1
compression: gzip
max_message_bytes: 1000000
========================
filebeat -e -c /etc/filebeat/filebeat.yml
结构化流:接入kafka的数据
import os
from pyspark.sql import SparkSession, DataFrame
import pyspark.sql.functions as F
os.environ['SPARK_HOME'] = '/software/server/spark/spark-2.4.5-bin-hadoop2.7'
os.environ['PYSPARK_PYTHON'] = '/software/server/miniconda3/bin/python3.7'
os.environ['JAVA_HOME'] = '/software/server/java/jdk1.8.0_221'
""" 连接kafka """
spark = SparkSession \
.builder \
.appName("test0218") \
.getOrCreate()
df = spark.readStream \
.format('kafka') \
.option('subscribe', 'test01') \
.option('kafka.bootstrap.servers', 'centos7:9092') \
.load()
结构化流:写入ES数据库
def process(df: DataFrame, batch_id: int):
"""foreachBatch 批处理"""
df.select(F.json_tuple(F.expr("cast(value as string)"), "host", "log", "originallog")) \
.withColumn("c0", F.json_tuple(F.expr("cast(c0 as string)"), "name")) \
.withColumn("c1", F.json_tuple(F.expr("cast(c1 as string)"), "file")) \
.withColumn("c1", F.json_tuple(F.expr("cast(c1 as string)"), "path")) \
.withColumnRenamed("c0", "hostname") \
.withColumnRenamed("c1", "path") \
.withColumnRenamed("c2", "originallog") \
.withColumnRenamed("c2", "originallog") \
.withColumn("date", F.current_timestamp().cast("string")) \
.write.format("org.elasticsearch.spark.sql")\
.option('es.resource', "test-0002/_doc")\
.option("es.nodes", "192.168.174.133")\
.option("es.port", "9200") \
.option("es.nodes.wan.only", "true")\
.option("es.write.operation", "create")\
.option("es.index.auto.create", "true")\
.mode("Append") \
.save()
es参数
.option("es.mapping.id", "logid") # 可以自定义唯一标识、选择更新方式,这里选择不设置,ES会自动生成
.option("es.write.operation", "create")
# 应该执行弹性搜索-hadoop的写操作-可以是以下任意一种:
index (默认):添加新数据,同时替换(重新索引)现有数据(基于其ID)。
create:添加新数据-如果数据已经存在(基于其ID),则会引发异常。
update:更新现有数据(基于其ID)。如果找不到数据,则引发异常。
upsert:如果数据不存在,则 称为合并或插入;如果数据存在(根据其ID),则更新
设置间隔时间
query = df.writeStream \
.outputMode('append') \
.foreachBatch(process) \
.trigger(processingTime='5 seconds') \
.start() \
.awaitTermination()
最终效果
启动zk和kafka,运行python程序,然后再启动filebeat。日志数据处理后效果如下。
{
"_index": "test-0002",
"_type": "_doc",
"_id": "x-wAZYYBDqFO3CbSSKO8",
"_version": 1,
"_score": 1,
"_source": {
"hostname": "centos7",
"path": "/var/log/messages-20230212",
"originallog": "Feb 8 21:20:56 localhost kernel: SRAT: PXM 0 -> APIC 0x6b -> Node 0",
"date": "2023-02-18 22:49:20.235"
}
}