pyspark学习笔记:filebeat收集日志通过kafka发送spark存入es-2023-2-18

filebeat收集日志通过kafka发送spark存入es

介绍:将filebeat收集的日志数据,接入spark进行更多处理,最后存入ES便于数据分析。
主要内容:

  1. filebeat发送至kafka的配置文件
  2. pyspark结构化流读取kafka
  3. pyspark结构化流存入Elasticsearch

启动zookeeper和kafka

# zookeeper启动
/software/server/zookeeper/apache-zookeeper-3.5.9-bin/bin/zkServer.sh start

# kafka前台:看日志
/software/server/kafka/kafka_2.11-2.4.1/bin/kafka-server-start.sh /software/server/kafka/kafka_2.11-2.4.1/config/server.properties

# kafka后台
nohup /software/server/kafka/kafka_2.11-2.4.1/bin/kafka-server-start.sh /software/server/kafka/kafka_2.11-2.4.1/config/server.properties 2>&1 &

filebeat数据发送kafka

vim /etc/filebeat/filebeat.yml
==========================
processors:
  - rename:
      fields:
        - from: "message"
          to: "originallog"
      ignore_missing: false
      fail_on_error: true
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/messages*
output.kafka:
  # initial brokers for reading cluster metadata
  hosts: ["centos7:9092"]
  # message topic selection + partitioning
  topic: "test01"
  partition.round_robin:
    reachable_only: false
  required_acks: 1
  compression: gzip
  max_message_bytes: 1000000
========================
filebeat -e -c /etc/filebeat/filebeat.yml

结构化流:接入kafka的数据

import os
from pyspark.sql import SparkSession, DataFrame
import pyspark.sql.functions as F

os.environ['SPARK_HOME'] = '/software/server/spark/spark-2.4.5-bin-hadoop2.7'
os.environ['PYSPARK_PYTHON'] = '/software/server/miniconda3/bin/python3.7'
os.environ['JAVA_HOME'] = '/software/server/java/jdk1.8.0_221'

""" 连接kafka """
spark = SparkSession \
    .builder \
    .appName("test0218") \
    .getOrCreate()

df = spark.readStream \
    .format('kafka') \
    .option('subscribe', 'test01') \
    .option('kafka.bootstrap.servers', 'centos7:9092') \
    .load()

结构化流:写入ES数据库

def process(df: DataFrame, batch_id: int):
    """foreachBatch 批处理"""
    df.select(F.json_tuple(F.expr("cast(value as string)"), "host", "log", "originallog")) \
        .withColumn("c0", F.json_tuple(F.expr("cast(c0 as string)"), "name")) \
        .withColumn("c1", F.json_tuple(F.expr("cast(c1 as string)"), "file")) \
        .withColumn("c1", F.json_tuple(F.expr("cast(c1 as string)"), "path")) \
        .withColumnRenamed("c0", "hostname") \
        .withColumnRenamed("c1", "path") \
        .withColumnRenamed("c2", "originallog") \
        .withColumnRenamed("c2", "originallog") \
        .withColumn("date", F.current_timestamp().cast("string")) \
        .write.format("org.elasticsearch.spark.sql")\
        .option('es.resource', "test-0002/_doc")\
        .option("es.nodes", "192.168.174.133")\
        .option("es.port", "9200") \
        .option("es.nodes.wan.only", "true")\
        .option("es.write.operation", "create")\
        .option("es.index.auto.create", "true")\
        .mode("Append") \
        .save()

es参数

.option("es.mapping.id", "logid")  # 可以自定义唯一标识、选择更新方式,这里选择不设置,ES会自动生成
.option("es.write.operation", "create")
# 应该执行弹性搜索-hadoop的写操作-可以是以下任意一种:
index (默认):添加新数据,同时替换(重新索引)现有数据(基于其ID)。
create:添加新数据-如果数据已经存在(基于其ID),则会引发异常。
update:更新现有数据(基于其ID)。如果找不到数据,则引发异常。
upsert:如果数据不存在,则 称为合并或插入;如果数据存在(根据其ID),则更新

设置间隔时间

query = df.writeStream \
    .outputMode('append') \
    .foreachBatch(process) \
    .trigger(processingTime='5 seconds') \
    .start() \
    .awaitTermination()

最终效果

启动zk和kafka,运行python程序,然后再启动filebeat。日志数据处理后效果如下。

{
"_index": "test-0002",
"_type": "_doc",
"_id": "x-wAZYYBDqFO3CbSSKO8",
"_version": 1,
"_score": 1,
"_source": {
"hostname": "centos7",
"path": "/var/log/messages-20230212",
"originallog": "Feb 8 21:20:56 localhost kernel: SRAT: PXM 0 -> APIC 0x6b -> Node 0",
"date": "2023-02-18 22:49:20.235"
}
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值