elasticsearch批量插入数据

最新推荐文章于 2023-09-08 21:01:45 发布

bug工厂

最新推荐文章于 2023-09-08 21:01:45 发布

阅读量4.7k

点赞数

本文链接：https://blog.csdn.net/qq_41100433/article/details/80698871

版权

官方文档地址：https://www.elastic.co/guide/en/kibana/6.1/tutorial-load-dataset.html

本教程需要三个数据集：

1.威廉莎士比亚的全部作品，适当地解析成字段。压缩文件需要解压。下载地址：

https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json

2.一组随机生成数据的虚构账户，下载地址

https://download.elastic.co/demos/kibana/gettingstarted/accounts.zip

3，一组随机生成的日志文件

https://download.elastic.co/demos/kibana/gettingstarted/logs.jsonl.gz

莎士比亚的作品集有这样的结构

{
    "line_id": INT,
    "play_name": "String",
    "speech_number": INT,
    "line_number": "String",
    "speaker": "String",
    "text_entry": "String",
}

虚拟账户结构是

{
    "account_number": INT,
    "balance": INT,
    "firstname": "String",
    "lastname": "String",
    "age": INT,
    "gender": "M or F",
    "address": "String",
    "employer": "String",
    "email": "String",
    "city": "String",
    "state": "String"
}

日志数据集有几十个不同的字段。以下是本教程的重要字段：

{
    "memory": INT,
    "geo.coordinates": "geo_point"
    "@timestamp": "date"
}

在加载莎士比亚和日志数据集之前，您必须设置字段的映射。映射将索引中的文档分成逻辑组并指定字段的特征。这些特征包括字段的可搜索性以及它是否被标记，或者分解成单独的单词。

在Kibana Dev Tools> Console中，设置莎士比亚数据集的映射：

PUT /shakespeare
{
 "mappings": {
  "doc": {
   "properties": {
    "speaker": {"type": "keyword"},
    "play_name": {"type": "keyword"},
    "line_id": {"type": "integer"},
    "speech_number": {"type": "integer"}
   }
  }
 }
}

该映射为数据集指定字段特征：speaker 和play_name字段是关键字字段。这些字段未被分析。即使它们包含多个单词，字符串也会被视为一个单元。 line_id 和speech_number 字段是整数。

日志数据集需要映射才能通过应用geo_point 类型将纬度和经度对标记为地理位置。

PUT /logstash-2015.05.18
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}

PUT /logstash-2015.05.19
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}

PUT /logstash-2015.05.20
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}

账户数据集不需要任何映射。

此时，就已准备好使用Elasticsearch批量API加载数据集，这里要注意的是这些文件的位置应该是你所在的当前目录，
如果你当前位置是D盘~那么这些文件位置就要放在D盘下，否则读不到：

curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary @accounts.json
curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json
curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/_bulk?pretty' --data-binary @logs.jsonl

或者对于Windows用户，在Powershell中：

Invoke-RestMethod "http://localhost:9200/bank/account/_bulk?pretty" -Method Post -ContentType 'application/x-ndjson' -InFile "accounts.json"
Invoke-RestMethod "http://localhost:9200/shakespeare/doc/_bulk?pretty" -Method Post -ContentType 'application/x-ndjson' -InFile "shakespeare_6.0.json"
Invoke-RestMethod "http://localhost:9200/_bulk?pretty" -Method Post -ContentType 'application/x-ndjson' -InFile "logs.jsonl"

这些命令可能需要一些时间才能执行，具体取决于计算机的性能

验证是否成功加载：

GET /_cat/indices?v

你的输出应该看起来类似于这个：

health status index               pri rep docs.count docs.deleted store.size pri.store.size
yellow open   bank                  5   1       1000            0    418.2kb        418.2kb
yellow open   shakespeare           5   1     111396            0     17.6mb         17.6mb
yellow open   logstash-2015.05.18   5   1       4631            0     15.6mb         15.6mb
yellow open   logstash-2015.05.19   5   1       4624            0     15.7mb         15.7mb
yellow open   logstash-2015.05.20   5   1       4750            0     16.4mb