记导入2000W+数据到elasticsearch

最新推荐文章于 2024-05-11 17:23:05 发布

heirenyagao112

最新推荐文章于 2024-05-11 17:23:05 发布

阅读量356

点赞数

分类专栏： es 文章标签： elasticsearch java 大数据

本文链接：https://blog.csdn.net/heirenyagao112/article/details/129257450

版权

es 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

记导入2000W+数据到elasticsearch

公司用户表经过长时间的积累数据已经到达了2000W+，还在迅速的增长，运营的需求又是需要可以通过昵称等信息模糊搜索，在这种需求下只能入es的方案了

第一次尝试

cloudcanal 增加任务，读取mysql的binlog日志，并且上发到kafka（由于我们的es版本为8.4.3，无法直接同步）

此时出现了问题：

cloudcanal 全量同步到一半就会自动挂掉，jvm老是出现oom等情况
fetch rsocket async result timeout. current timeout: 10000 ms. route name:taskReceiveHeartBeat, request id:09ceb9e4-b411-11ed-8604-f1bf65ec6f83
java.util.concurrent.TimeoutException: Waited 10000 milliseconds for

解决方案：

increRingBufferSize、increBatchSize值稍微调小

之后我就写了个入es的go脚本，代码如下：

在这里插入图片描述

我用的是github.com/elastic/go-elasticsearch/v8/esapi 这个库

运行了几个小时候后，我发现数据量增长只有几十万，此时想到是否是消费太慢了？？

第二次尝试

我自己新创建了一个topic，partition则创建了20个（cloudcanal默认为4个），

开启了15个进程进行同时消费，这个时候数据量增长的已经比较快了

过了两天数据我看到增长到1000W+，按照之前看到的速度不应该才这么点数据，迅速查看了一下日志，发现日志上面消费的特别慢

开始查看问题：

kafka卡住了？然后新开了一个进行，什么操作都不做，直接进行消费，发现速度特别快，这个排除
上发es卡住？本地断点查看后确实是这样，此时我决定用channel开goroutine进行协成消费，代码如下：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sIAEK5z7-1677553118297)(/Users/lihuanjie/Library/Application Support/typora-user-images/image-20230228104006581.png)]

此时消费出现了一顿一顿的现象，明显感觉没有彻底解决这个问题，并且当goroutine 过多以后出现了一个 open file to many 应该是socket文件描述符打开的过多

第三次尝试

es导入批量模式，代码如下：

		if canalData.Action == "INSERT" || canalData.Action == "UPDATE" {
			for _, v := range canalData.Data {
				item := v

				var obj dto.Item
				build(&obj, item)

				month := obj.CreatedAt.Format("200601")
				num++

				meta := []byte(fmt.Sprintf(`{ "index" : { "_id" : "%d" } }%s`, obj.Id, "\n"))
				data, err := json.Marshal(obj)
				if err != nil {
					fmt.Printf("Cannot encode article %d: %s", obj.Id, err)
				}
				data = append(data, "\n"...) // <-- Comment out to trigger failure for batch

				buf.Grow(len(meta) + len(data))
				buf.Write(meta)
				buf.Write(data)
			}
		}


    // 一旦有一个到达临界值则直接进行插入
		if num > 200 {
			for k := range numMap {
				fmt.Println("正在执行")
				esClient := es.GetClient()
					res, err := esClient.Bulk(bytes.NewReader(buf.Bytes()), esClient.Bulk.WithIndex("open_users_"+k))
					if err != nil {
						fmt.Printf("Failure indexing batch %s", err)
					}

					if res.IsError() {
						if err := json.NewDecoder(res.Body).Decode(&raw); err != nil {
							fmt.Printf("Failure to to parse response body: %s", err)
						} else {
							fmt.Printf("  Error: [%d] %s: %s",
								res.StatusCode,
								raw["error"].(map[string]interface{})["type"],
								raw["error"].(map[string]interface{})["reason"],
							)
						}
						// A successful response might still contain errors for particular documents...
						//
					} else {
						if err := json.NewDecoder(res.Body).Decode(&blk); err != nil {
							fmt.Printf("Failure to to parse response body: %s", err)
						} else {
							for _, d := range blk.Items {
								// ... so for any HTTP status above 201 ...
								//
								if d.Index.Status > 201 {
									// ... increment the error counter ...
									//

									// ... and print the response status and error information ...
									fmt.Printf("  Error: [%d]: %s: %s: %s: %s",
										d.Index.Status,
										d.Index.Error.Type,
										d.Index.Error.Reason,
										d.Index.Error.Cause.Type,
										d.Index.Error.Cause.Reason,
									)
								} else {
									// ... otherwise increase the success counter.
									//
								}
							}
						}
					}

					// Close the response body, to prevent reaching the limit for goroutines or file handles
					//
					res.Body.Close()

					// Reset the buffer and items counter
					//
					buf.Reset()
					num = 0
				}
		}

我设置了到200个以后才请求es的批量接口，此时发现消费有显著的提高但是遇到请求es接口的时候还是会卡