文档系统：HDFS+Elasticsearch

四月天03

已于 2024-02-29 10:12:22 修改

阅读量242

点赞数

分类专栏： temp 文章标签：大数据 hadoop

于 2019-02-28 14:40:39 首次发布

本文链接：https://blog.csdn.net/qq_22473611/article/details/88027945

版权

temp 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

要将文档写入HDFS并为每个文件在Elasticsearch中建立索引，您需要执行以下步骤：

第一步：将文档写入HDFS

首先，您需要将文档写入HDFS。这通常涉及将文件从本地文件系统复制到HDFS的相应目录中。可以使用Hadoop的命令行工具hdfs dfs -put，或者编写Java程序使用Hadoop API来完成这一任务。

使用命令行工具

hdfs dfs -mkdir -p /path/to/hdfs/documents
hdfs dfs -put /path/to/local/documents/* /path/to/hdfs/documents/

使用Java程序

public class WriteDocumentsToHDFS {  
    public static void main(String[] args) throws IOException {  
        // HDFS配置  
        Configuration conf = new Configuration();  
        conf.set("fs.defaultFS", "hdfs://localhost:9000");  
        FileSystem fs = FileSystem.get(conf);  
  
        // 本地文件系统中的文档目录  
        File localDir = new File("/path/to/local/documents");  
        File[] files = localDir.listFiles();  
  
        if (files != null) {  
            for (File file : files) {  
                if (file.isFile()) {  
                    // HDFS目标路径  
                    Path hdfsPath = new Path("/path/to/hdfs/documents/" + file.getName());  
  
                    try (BufferedInputStream inputStream = new BufferedInputStream(new FileInputStream(file))) {  
                        // 将文件从本地复制到HDFS  
                        fs.copyFromLocalFile(file.getPath(), hdfsPath);  
                        System.out.println("Copied " + file.getName() + " to HDFS.");  
                    }  
                }  
            }  
        }  
         fs.close();  
    }  
}

这一步与之前的步骤相同，您需要将文档写入HDFS的某个目录。

第二步：为Elasticsearch定义映射和分词器

在将文档索引到Elasticsearch之前，您需要为索引定义一个映射（mapping），并在映射中指定分词器。分词器决定了如何将文本字段分解成单个的词条（terms）。

例如，如果您使用的是Elasticsearch的默认分词器（standard analyzer），它适用于大多数语言，并且会按照空格和标点符号来分解文本。但如果您需要处理特定语言（如中文），您可能需要使用更适合的分词器，如IK分词器。

定义映射和分词器示例（使用IK分词器）：

PUT /your_index_name  
{  
  "settings": {  
    "analysis": {  
      "analyzer": {  
        "my_ik_analyzer": {  
          "type": "ik_max_word",  
          "use_smart": false  
        }  
      }  
    }  
  },  
  "mappings": {  
    "properties": {  
      "content": {  
        "type": "text",  
        "analyzer": "my_ik_analyzer"  
      }  
      // 其他字段定义...  
    }  
  }  
}

在这个例子中，我们定义了一个名为my_ik_analyzer的分词器，它是IK分词器的一个变体（ik_max_word），它会将文本分解成尽可能多的词条。use_smart参数设置为false表示不使用智能分词模式。

第三步：从HDFS读取文件并为每个文件在Elasticsearch中建立索引

在定义了映射和分词器之后，您可以从HDFS读取文件，并使用指定的分词器将文件内容索引到Elasticsearch中。

第三步：从HDFS读取文件并为每个文件在Elasticsearch中建立索引

一旦文档被写入HDFS，您可以使用Elasticsearch的Java客户端从HDFS读取这些文件，并为每个文件创建一个索引。

public class IndexDocumentsFromHDFSToElasticsearch {  
    public static void main(String[] args) throws Exception {  
        // 初始化Elasticsearch客户端  
        RestHighLevelClient client = new RestHighLevelClient(  
                RestClient.builder(new HttpHost("localhost", 9200, "http")));  
  
        // HDFS配置  
        Configuration conf = new Configuration();  
        conf.set("fs.defaultFS", "hdfs://localhost:9000");  
        FileSystem fs = FileSystem.get(conf);  
  
        // HDFS中文档目录路径  
        Path hdfsDirPath = new Path("/path/to/hdfs/documents");  
        FileStatus[] fileStatuses = fs.listStatus(hdfsDirPath);  
  
        for (FileStatus fileStatus : fileStatuses) {  
            if (fileStatus.isFile()) {  
                Path filePath = fileStatus.getPath();  
  
                try (BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(filePath)))) {  
                    StringBuilder sb = new StringBuilder();  
                    String line;  
                    while ((line = br.readLine()) != null) {  
                        sb.append(line).append("\n");  
                    }  
  
                    // 假设整个文件内容是一个JSON文档  
                    String jsonDocument = sb.toString();  
  
                    // 创建索引请求，指定索引名称和文档ID，以及文档内容  
                    IndexRequest indexRequest = new IndexRequest("your_index_name")  
                            .id(filePath.getName()) // 可以使用文件名作为文档ID，或者生成一个唯一的ID  
                            .source(jsonDocument, XContentType.JSON);  
  
                    // 发送索引请求并获取响应  
                    IndexResponse indexResponse = client.index(indexRequest, RequestOptions.DEFAULT);  
  
                    // 处理响应，例如检查是否成功  
                    if (indexResponse.getResult() == DocWriteResponse.Result.CREATED) {  
                        System.out.println("Document has been created in Elasticsearch.");  
                    } else {  
                        System.out.println("Document update or something else.");  
                    }  
                }  
            }  
        }  
  
        // 关闭客户端连接  
        client.close();

四月天03

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
1
评论
文档系统：HDFS+Elasticsearch

在将文档索引到Elasticsearch之前，您需要为索引定义一个映射（mapping），并在映射中指定分词器。分词器决定了如何将文本字段分解成单个的词条（terms）。例如，如果您使用的是Elasticsearch的默认分词器（standard analyzer），它适用于大多数语言，并且会按照空格和标点符号来分解文本。但如果您需要处理特定语言（如中文），您可能需要使用更适合的分词器，如IK分词器。},// 其他字段定义...在这个例子中，我们定义了一个名为的分词器，它是IK分词器的一个变体（
复制链接

扫一扫