ELK-1.5亿数据处理完整过程

#filebeat配置
  filebeat.yml
    -源文件类型、路径、encoding(编码为utf-8可以忽略)
    -输出地方 logstash或ES
  filebeat启动命令
    -filebeat.exe -e -c filebeat.yml
#logstash配置
  logstash.conf文件
    -启动conf文件配置 见下面logstash-geonames.conf
  logstash启动命令
    -logstash.bat -f ../config/logstash-geonames.conf
#es配置
  elasticsearch.yml
    -跨域问题 解决
      http.cors.enabled: true
      http.cors.allow-origin: "*"
  es启动
    -直接点击elasticsearch.bat
#es-head配置
  参看github 关于es-head配置
#kibana配置
  默认即可
#注意必须字段中包含lon、lat,以便kibana识别geo_point类型数据
#_mapping
  http://localhost:9200/geochina/infors
  _mapping?include_type_name=true  post
  {
      "properties": {
          "id": {
            "type": "keyword"
          },
          "lon": {
            "type": "float"
          },
          "lat": {
            "type": "float"
          },
          "name": {
            "type": "text",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_smart"
          },
          "address": {
            "type": "keyword"
          },
          "telephone": {
            "type": "keyword"
          },
          "type": {
            "type": "keyword"
          },
          "areaid": {
            "type": "keyword"
          },
          "wgslng": {
            "type": "float"
          },
          "wgslat": {
            "type": "float"
          },
          "bdlng": {
            "type": "float"
          },
          "bdlat": {
            "type": "float"
          },
          "updatetime": {
            "type": "keyword"
          },
          "isdelete": {
            "type": "keyword"
          },
          "areaname": {
            "type": "keyword"
          },
          "parentname": {
            "type": "keyword"
          },
          "location": {
            "type": "geo_point"
          }
      }
  }

#config  注意必须包含lon、lat 不然kibana中识别不了geo_point类型数据
  # Sample Logstash configuration for creating a simple
  # Beats -> Logstash -> Elasticsearch pipeline.

  input {
          beats {
            port => 5044
          }
        }
  filter {
    csv {
      skip_header =>"true"
      separator => ","
      columns => ["id","lon","lat","name","address","telephone","type","areaid","wgslng","wgslat","bdlng","bdlat","updatetime","isdelete","areaname","parentname"]
      add_field => ["[location][lon]","%{wgslng}"]
      add_field => ["[location][lat]","%{wgslat}"]
      remove_field => ["message","headers","@version","version","ecs","@timestamp","tags","agent","input","host","log","offset"]    
    }
    mutate {
      convert => {
        # 类型转换
        "id"=>"string"
        "lon" => "float"
        "lat" => "float"
        "name" => "string"
        "address" => "string"
        "telephone" => "string"
        "type" => "string"
        "areaid" => "string"
        "wgslng" => "float"
        "wgslat" => "float"
        "bdlng" => "float"
        "bdlat" => "float"
        "updatetime" => "string"
        "isdelete" => "string"
        "areaname" => "string"
        "parentname" => "string"
        "[location][lon]" => "float"
        "[location][lat]" => "float"
      }
    }
  }
  output {
      elasticsearch {
      hosts => ["127.0.0.1:9200"]
      index => "geochina3"
      document_type => "infors3"
    }
  }
#其他临时知识点
  查询
  {
    "query": {
    "match": {
      "name": "玉泉山"
    }
    }
  }

#问题解决
  1.数据文件编码为UTF-8格式 如何使用filebeat+logstash+es导入过程不出现数据乱码?
    filebeat.yml 配饰文件中默认是UTF-8且filebeat默认读取数据为utf-8(所以不需要进行encoding的配置)
    logstash.conf 默认读取数据格式为UTF-8所以也不需要特别配置
  2.数据文件编码为GB2312/GBK格式 如何使用filebeat+logstash+es导入过程不出现数据乱码?
    filebeat.yml 配饰文件中默认是UTF-8且filebeat默认读取数据为utf-8而原数据编码为GB2312,所以需要添加encoding:GB2312的配置
        - type: log
        # Change to true to enable this input configuration.
        enabled: true
        # Paths that should be crawled and fetched. Glob based paths.
        paths:
          #- /var/log/*.log
          - E:\FGQ\ELK\areanamePareaname.txt
        encoding: GB2312/GBK
    logstash.conf 默认读取数据格式为UTF-8而进来的是GB2312,所以需要将GB2312转成UTF-8才不会导致乱码,见如下配置
        input {
          beats {
            codec => plain { 
              charset => "UTF-8" 
              }
            port => 5044
          }
        }
  3.如果源文件的编码不知道为何,且已经试过上述两种方式gbk、gb2312、plain、utf-8等,那么就需要将源文件转为自己熟知的文件格式,如何转如下
    方式一:源文件大小没限制(我使用情况是47G的csv)
      Get-Content E:\FGQ\ELK\poi.csv" | out-file "E:\FGQ\ELK\poiutf-8\poiutf8.csv" -encoding utf-8
      Get-Content E:\FGQ\ELK\poi.csv | Out-File E:\FGQ\ELK\poi_6.csv -encoding utf8
    方式二:源文件要求大小不超过2G
      powershell 批量转换文本文件编码(GBK转UTF-8)
        手头有一批SQL文件,通过某程序批量更新到Local DB。但是发现导进去后中文变乱码(一堆????),而且日志里头insert语句中文已经变成乱码,想来应该是编码的问题。一看SQL文件,GBK(系统默认编码)编码,于是想统一改成UTF-8编码。又不想去找各种工具了,直接用Powershell搞搞了。

    02 正文
    刚开始直接用powershell的内置命令get-content 和 set-content是挺方便的,但是结果一看,GBK转到了UTF-8 BOM 编码,不是很满意。但是也贴出来作为参考吧。

    @echo off
    powershell.exe -command "dir *.sql -R|foreach-object{(Get-Content $_.FullName -Encoding Default) | Set-Content $_.FullName -Encoding UTF8 };Write-Host '转换完成...'"
    pause

    于是想到另外一个方法——.net。最后得到想要的UTF-8编码。

    @echo off
    powershell.exe -command "dir *.sql -R|foreach-object{[void][System.IO.File]::WriteAllBytes($_.FullName,[System.Text.Encoding]::Convert([System.Text.Encoding]::GetEncoding('GBK'),[System.Text.Encoding]::UTF8,[System.IO.File]::ReadAllBytes($_.FullName)))};Write-Host '转换完成...'"
    pause

    脚本使用说明

    脚本在powershell 5.1下测试通过
    powershell脚本嵌入了CMD命令,所以另存为.bat,然后双击运行即可(递归遍历所有子目录,如果不需要,请将dir *.sql -R修改为dir *.sql)
    如果是其他文本格式如csv或txt,请将脚本中dir *.sql修改为dir *.csv 或 dir *.txt
    其他编码转换也可参考,做适当修改即可
    该脚本如果执行成功(不报错),请勿重复执行!否则可能会造成真的乱码。
  4.源数据中每个字段对应数据都带有双引号 结果导致导入es后去kibana中查看 结果geo_point无法生效
      原因是因为es中索引的_mapping和logstash中config的字段当中没有包含lon、lat字段 导致的
      或
      重新创建索引及_mappnig 再试
#临时记录
 elasticsearch {
    hosts => ["127.0.0.1:9200"]
    index => "geochina"
    document_type => "infors"
  }

   stdout {
        codec=>rubydebug
    }
    codec => json


codec => plain { 
        charset => "UTF-8"
        }

          codec => plain { 
        charset => "GBK"
        }

        codec => json_lines 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值