influxdb源码解析-数据写入细节

最新推荐文章于 2024-06-10 10:00:00 发布

MatrixYg

最新推荐文章于 2024-06-10 10:00:00 发布

阅读量1.3k

点赞数 1

分类专栏： GO 时序数据库文章标签： influxdb 数据写入源码解析

本文链接：https://blog.csdn.net/weixin_41863129/article/details/124363011

版权

GO 同时被 2 个专栏收录

21 篇文章 4 订阅

订阅专栏

时序数据库

15 篇文章 12 订阅

订阅专栏

前言

这是一个分析inlfuxdb源码的系列。在此之前，已经分析了数据的基本模型，以及写入流程。在上一章数据写入部分，我们分析的是数据写入的基本流程，怎么从一个http的请求解析数据，然后计算shardgroup，shard等元数据信息，最后写入到具体的shard上。这一章分析数据写入的细节。

提示

这一章是对infludb源码分析-数据写入的补充，所以如果没看过上一章节的，辛苦去看看，直接看这一章很难看懂。

数据解析

在上一章节中，数据解析部分的流程是：读取数据并且返回一个[]Point的slice同时也说到了，解析point是遵循了influxdb 的行协议，解析到的过程是一个状态机。这一小节，就来看看具体的逻辑。还是从ParsePointsWithPrecision出发：
在这里插入图片描述
这个函数的主要逻辑是parsePoint。那么重点分析一下这个函数，这个函数精简之后可以逻辑如下(篇幅有限，我就省略了很多错误处理部分)

func parsePoint(buf []byte, defaultTime time.Time, precision string) (Point, error) {
	// scan the first block which is measurement[,tag1=value1,tag2=value2...]
	pos, key, err := scanKey(buf, 0)
	pos, fields, err := scanFields(buf, pos)

	var maxKeyErr error
	err = walkFields(fields, func(k, v []byte) bool {
		if sz := seriesKeySize(key, k); sz > MaxKeyLength {
			maxKeyErr = fmt.Errorf("max key length exceeded: %v > %v", sz, MaxKeyLength)
			return false
		}
		return true
	})

	// scan the last block which is an optional integer timestamp
	pos, ts, err := scanTime(buf, pos)
	if err != nil {
		return nil, err
	}
}

这个函数的逻辑说明：解析一个point分为三个部分：scan key，scan Fields ，scanTime

scan key

在influxdb中，key指的是measurement+tagks+tagvalues.所以scan key就分为两个主要的步骤，scan measurement和scan tags

scan measurement

// scanKey scans buf starting at i for the measurement and tag portion of the point.
// It returns the ending position and the byte slice of key within buf.  If there
// are tags, they will be sorted if they are not already.
func scanKey(buf []byte, i int) (int, []byte, error) {
	start := skipWhitespace(buf, i)

	i = start

	// Determines whether the tags are sort, assume they are
	sorted := true

	// indices holds the indexes within buf of the start of each tag.  For example,
	// a buf of 'cpu,host=a,region=b,zone=c' would have indices slice of [4,11,20]
	// which indicates that the first tag starts at buf[4], seconds at buf[11], and
	// last at buf[20]
	indices := make([]int, 100)

	// tracks how many commas we've seen so we know how many values are indices.
	// Since indices is an arbitrarily large slice,
	// we need to know how many values in the buffer are in use.
	commas := 0

	// First scan the Point's measurement.
	state, i, err := scanMeasurement(buf, i)
	if err != nil {
		return i, buf[start:i], err
	}
}

在scanKey开始的部分，便是对measurement的处理，scanMeasurement的逻辑：

func scanMeasurement(buf []byte, i int) (int, int, error) {
	// Check first byte of measurement, anything except a comma is fine.
	// It can't be a space, since whitespace is stripped prior to this
	// function call.
	if i >= len(buf) || buf[i] == ',' {
		return -1, i, fmt.Errorf("missing measurement")
	}

	for {
		i++
		if i >= len(buf) {
			// cpu
			return -1, i, fmt.Errorf("missing fields")
		}

		if buf[i-1] == '\\' {
			// Skip character (it's escaped).
			continue
		}

		// Unescaped comma; move onto scanning the tags.
		if buf[i] == ',' {
			return tagKeyState, i + 1, nil
		}

		// Unescaped space; move onto scanning the fields.
		if buf[i] == ' ' {
			// cpu value=1.0
			return fieldsState, i, nil
		}
	}
}

scanMeasurement在找第一个,和第一个空白，按照行协议，如果后面有,那么就是有tag信息，如果遇到了空白，那么就是field信息。所以这里的返回，第一个返回是当前状态，然后是下标接下来的解析，会根据之前的状态来选择。到这里measurement解析顺利完成，并且返回值是tagState或者fieldState。那么scanKey接下来的逻辑：

	// Optionally scan tags if needed.
	if state == tagKeyState {
		i, commas, indices, err = scanTags(buf, i, indices)
		if err != nil {
			return i, buf[start:i], err
		}
	}

如果是遇到了tag，就开始scanTags。

scan Tags

scanTags和scanMeasurement的逻辑类似

func scanTags(buf []byte, i int, indices []int) (int, int, []int, error) {
	var (
		err    error
		commas int
		state  = tagKeyState
	)

	for {
		switch state {
		case tagKeyState:
			// Grow our indices slice if we have too many tags.
			if commas >= len(indices) {
				newIndics := make([]int, cap(indices)*2)
				copy(newIndics, indices)
				indices = newIndics
			}
			indices[commas] = i
			commas++

			i, err = scanTagsKey(buf, i)
			state = tagValueState // tag value always follows a tag key
		case tagValueState:
			state, i, err = scanTagsValue(buf, i)
		case fieldsState:
			indices[commas] = i + 1
			return i, commas, indices, nil
		}

		if err != nil {
			return i, commas, indices, err
		}
	}
}

这里分解为scanTagKey和scanTagValue,scan系列的操作，返回值都是一样的，第一个是state，然后是下标。 这里也就是之前说的，解析的过程是一个状态机。这里有一个很有用的信息indices数组。
注意到，在状态流转到tagKeyState时，会记录当前的下标，注意这里记录的其实并不是tagkey开始的地方，而是它前面那个逗号，所以这里的变量名字叫做commas，中文解释为逗号的意思。通过这个数组，我们就可以知道每个tags开始和结束的位置。
到这里keys信息已经解析完毕了，可以发现这里的解析是非常轻量级的，只是找到了一些关键的分界点。

重复tags校验和tags排序

在解析完key之后，scanKey接下来对tags信息做了一些校验。首先是很粗粒度的判断是否排序：

	for j := 0; j < commas-1; j++ {
		// get the left and right tags
		_, left := scanTo(buf[indices[j]:indices[j+1]-1], 0, '=')
		_, right := scanTo(buf[indices[j+1]:indices[j+2]-1], 0, '=')

		// If left is greater than right, the tags are not sorted. We do not have to
		// continue because the short path no longer works.
		// If the tags are equal, then there are duplicate tags, and we should abort.
		// If the tags are not sorted, this pass may not find duplicate tags and we
		// need to do a more exhaustive search later.
		if cmp := bytes.Compare(left, right); cmp > 0 {
			sorted = false
			break
		} else if cmp == 0 {
			return i, buf[start:i], fmt.Errorf("duplicate tags")
		}
	}

这里遍历整个tag列表，对比两个相邻的tag，如果不是从小到大的顺序，那么就没有排序！如果发现cmp值相等，说明是一样的tag，那么就打印错误。 那么这里就有疑问了，这里检查有没有排序是没有问题的，但是检查重复tag是有问题的，因为可能重复的tag并不是相邻的！ 所以在接下来，如果是没有排序过，那么会做一个排序，接着检查是否有重复tag。

	if !sorted && commas > 0 {
		// Get the measurement name for later
		measurement := buf[start : indices[0]-1]

		// Sort the indices
		indices := indices[:commas]
		insertionSort(0, commas, buf, indices)

		// Create a new key using the measurement and sorted indices
		b := make([]byte, len(buf[start:i]))
		pos := copy(b, measurement)
		for _, i := range indices {
			b[pos] = ','
			pos++
			_, v := scanToSpaceOr(buf, i, ',')
			pos += copy(b[pos:], v)
		}

		// Check again for duplicate tags now that the tags are sorted.
		for j := 0; j < commas-1; j++ {
			// get the left and right tags
			_, left := scanTo(buf[indices[j]:], 0, '=')
			_, right := scanTo(buf[indices[j+1]:], 0, '=')

			// If the tags are equal, then there are duplicate tags, and we should abort.
			// If the tags are not sorted, this pass may not find duplicate tags and we
			// need to do a more exhaustive search later.
			if bytes.Equal(left, right) {
				return i, b, fmt.Errorf("duplicate tags")
			}
		}

		return i, b, nil
	}

这里使用插入排序(insertionSort)，来对indices排序，但是可能又有人有疑问了，indices数组本来就是有序的，而且indices只是一个下标，对他排序就能保证tags排序吗？显然，这里不是简单地对indices排序，而是对他代表的tags信息排序。

func insertionSort(l, r int, buf []byte, indices []int) {
	for i := l + 1; i < r; i++ {
		for j := i; j > l && less(buf, indices, j, j-1); j-- {
			indices[j], indices[j-1] = indices[j-1], indices[j]
		}
	}
}

可以看到重写了Less函数，这个里面就是比较两个相邻tag的字典序。当把tag排好序之后，接下来就可以检查是不是有重复tag了。

数据返回

scanKey结束之后，返回的是一个slice，和一个下标。这个slice 代表的就是measurement+tags的所有信息。

return i, buf[start:i], nil

总结

可以看到scanKey的设计还是很优雅的，一方面在减少数据拷贝，并且减少计算。例如其实上来就可以对tags排序，但是这里是检查是否排序之后，再做排序，因为对tag排序这个操作是比较浪费的，这个操作完全可以前置到数据上报的地方。所以如果能这样做，那么将会减少很多的计算。

scanField 和scanTime

上面分析完scanKey之后，接下来就是scanTime和scanField，这里不准备再一一分析了，因为逻辑都是类似的。就不再占用篇幅。有兴趣的读者，可以自己看看实现。基本的逻辑就是按照行协议来解析byte 数组里面的数据。

数据返回

在完成scan 操作之后，就需要返回数据了。这里是解析的一个point，所以返回：

	pt := &point{
		key:    key,
		fields: fields,
		ts:     ts,
	}

这里说明只对point的三个字段赋值了。其他字段都是空值。

数据解析总结

ParsePointsWithPrecision 是按照换行符把数据分为一行一行，然后每一行使用parsePoint 解析出来一个Point结构，这个结构里面只有三个值：key，fields，ts。然后把每一行生成的Point组成一个Slice返回，完成数据解析。可以看到，这里的解析也是一种部分解析，没有感知key里面的tags具体信息，也就是没有把tagk和tagv解析出来。

数据校验和元数据更新

当完成解析之后，接下来要做的就是做map，把point 映射到具体的shardgroup里面的shard上，完后执行shard的WritePointsWithContext函数，完成对point的写入。前面映射部分，在上一章有详细的讲过，可以翻出来看看，就不再赘述。接下来看看WritePointsWithContext的逻辑

	s.mu.RLock()
	defer s.mu.RUnlock()

	engine, err := s.engineNoLock()
	if err != nil {
		return err
	}

	var writeError error
	atomic.AddInt64(&s.stats.WriteReq, 1)

	points, fieldsToCreate, err := s.validateSeriesAndFields(points)
	if err != nil {
		if _, ok := err.(PartialWriteError); !ok {
			return err
		}
		// There was a partial write (points dropped), hold onto the error to return
		// to the caller, but continue on writing the remaining points.
		writeError = err
	}

写入开始前，还是加锁保证来防止并发写入问题。接着是validateSeriesAndFields这个逻辑是干啥的呢？从返回值和签名来看，是检查点是不是合法并且返回了一些需要create 的field。

validateSeriesAndFields

// validateSeriesAndFields checks which series and fields are new and whose metadata should be saved and indexed.
func (s *Shard) validateSeriesAndFields(points []models.Point) ([]models.Point, []*FieldCreate, error) {
}

validateSeriesAndFields是这样写的，检查哪些series和field需要新建，和哪些meta需要被索引 函数的开始部分:

	// Create all series against the index in bulk.
	keys := make([][]byte, len(points))
	names := make([][]byte, len(points))
	tagsSlice := make([]models.Tags, len(points))

	// Check if keys should be unicode validated.
	validateKeys := s.options.Config.ValidateKeys

	var j int
	for i, p := range points {
		tags := p.Tags()

		// Drop any series w/ a "time" tag, these are illegal
		if v := tags.Get(timeBytes); v != nil {
			dropped++
			if reason == "" {
				reason = fmt.Sprintf(
					"invalid tag key: input tag \"%s\" on measurement \"%s\" is invalid",
					"time", string(p.Name()))
			}
			continue
		}

		// Drop any series with invalid unicode characters in the key.
		if validateKeys && !models.ValidKeyTokens(string(p.Name()), tags) {
			dropped++
			if reason == "" {
				reason = fmt.Sprintf("key contains invalid unicode: \"%s\"", string(p.Key()))
			}
			continue
		}

		keys[j] = p.Key()
		names[j] = p.Name()
		tagsSlice[j] = tags
		points[j] = points[i]
		j++
	}
	points, keys, names, tagsSlice = points[:j], keys[:j], names[:j], tagsSlice[:j]

keys,names,tagsSlice 是用来保存最后生成的结果。然后遍历上面解析好的Point slice，
注意for循环的第一行，是Tags方法，当时刚才的解析只有Key，没有Tag信息，

// Tags returns the tag set for the point.
func (p *point) Tags() Tags {
	if p.cachedTags != nil {
		return p.cachedTags
	}
	p.cachedTags = parseTags(p.key, nil)
	return p.cachedTags
}

这里也是一种懒加载机制，在使用的时候再解析tag信息，并且caeche起来。接下来的逻辑是检查是不是有time这个tag，time是不能作为tag的，因为timestamp是单独作为一个field出现。
接下来是检车tag是不是合法，这里就不展开了。检查通过的数据保存起来，并且在最后做一下重新赋值。这样上半部分逻辑就完成了，上半部分主要是去掉一些tag非法的point

CreateSeriesListIfNotExists: 新Series的创建

检查部分结束之后，得到了一些新的数组，接下来的逻辑：

	// Add new series. Check for partial writes.
	var droppedKeys [][]byte
	if err := engine.CreateSeriesListIfNotExists(keys, names, tagsSlice); err != nil {
		switch err := err.(type) {
		// TODO(jmw): why is this a *PartialWriteError when everything else is not a pointer?
		// Maybe we can just change it to be consistent if we change it also in all
		// the places that construct it.
		case *PartialWriteError:
			reason = err.Reason
			dropped += err.Dropped
			droppedKeys = err.DroppedKeys
			atomic.AddInt64(&s.stats.WritePointsDropped, int64(err.Dropped))
		default:
			return nil, nil, err
		}
	}

调用了CreateSeriesListIfNotExists，来把一些不存在的series创建出来。这里的逻辑很深入，所以我们暂时分析浅层次的部分。
series属于是元数据信息，这些信息被索引了起来。也就是被Index模块管理。Index在influxdb中有两种实现，一种是inmem的，也就是纯内存实现，另外一种是tsi:Time Series Index.是一种倒排索引，支持落盘。这里CreateSeriesListIfNotExists就是在更新索引，如果是内存索引，那就是更新内存信息，如果是tsi，不仅要更新tsi的Cache，还有可能出发tsi的compact。这部分我们将在下一章进行完整的解析。

func (e *Engine) CreateSeriesListIfNotExists(keys, names [][]byte, tagsSlice []models.Tags) error {
	return e.index.CreateSeriesListIfNotExists(keys, names, tagsSlice)
}

这里可以发现，engine的CreateSeriesListIfNotExists是委托给了indexCreateSeriesListIfNotExists来实现。这里会把不存在的series，创建并索引出来。

field 的新建

在开始的时候，validateSeriesAndFields的注释就说到，检查哪些series和field需要被创建，所以创建完series，接下来就是创建不存在的field。逻辑有点长，我简化了一下：

j:=0
for i, p := range points {
		name := p.Name()
		// 查询所有的field
		mf := engine.MeasurementFields(name)
		// 大量的check，去除掉不规范的point
		points[j] = points[i]
		j++
		// Create any fields that are missing.
		iter := p.FieldIterator()
		for iter.Next() {
			fieldKey := iter.FieldKey()
			// Skip fields named "time". They are illegal.
			if bytes.Equal(fieldKey, timeBytes) {
				continue
			}
			// check field是不是存在
			if mf.FieldBytes(fieldKey) != nil {
				continue
			}

			dataType := dataTypeFromModelsFieldType(iter.Type())
			if dataType == influxql.Unknown {
				continue
			}

			fieldsToCreate = append(fieldsToCreate, &FieldCreate{
				Measurement: name,
				Field: &Field{
					Name: string(fieldKey),
					Type: dataType,
				},
			})
		}
}

这里的逻辑简化后还是很清晰的，遍历所有点的所有field，首先拿到这个点的measurement，并且使用measurement的FieldBytes方法，检查当前field是不是存在，如果不存在就加入到需要新建的slice

总结

validateSeriesAndFields方法主要做了三件事情：

检查非法的point
新建不存在的series
找到需要新建的field。
做完这些事情之后返回：

	return points[:j], fieldsToCreate, err

返回的值第一个是合法的point slice，然后是需要创建的field结构。

createFieldsAndMeasurements

在validateSeriesAndFields之后，会得到一个需要创建的field的结构，那么接下来就是创建这些结构。

	atomic.AddInt64(&s.stats.FieldsCreated, int64(len(fieldsToCreate)))

	// add any new fields and keep track of what needs to be saved
	if err := s.createFieldsAndMeasurements(fieldsToCreate); err != nil {
		return err
	}

shard的createFieldsAndMeasurements委托给了MeasurementFields的CreateFieldIfNotExists方法。这个方法的逻辑其实也比较简单：

func (m *MeasurementFields) CreateFieldIfNotExists(name []byte, typ influxql.DataType) error {
	fields := m.fields.Load().(map[string]*Field)

	// Ignore if the field already exists.
	if f := fields[string(name)]; f != nil {
		if f.Type != typ {
			return ErrFieldTypeConflict
		}
		return nil
	}

	m.mu.Lock()
	defer m.mu.Unlock()

	fields = m.fields.Load().(map[string]*Field)
	// Re-check field and type under write lock.
	if f := fields[string(name)]; f != nil {
		if f.Type != typ {
			return ErrFieldTypeConflict
		}
		return nil
	}

	fieldsUpdate := make(map[string]*Field, len(fields)+1)
	for k, v := range fields {
		fieldsUpdate[k] = v
	}
	// Create and append a new field.
	f := &Field{
		ID:   uint8(len(fields) + 1),
		Name: string(name),
		Type: typ,
	}
	fieldsUpdate[string(name)] = f
	m.fields.Store(fieldsUpdate)

	return nil
}

可以看到首先是检查是不是存在，或者存在之后type是不是一致，然后新建一个map，更新掉之前的map结构。整个流程就完成了。

数据写入

在完成元数据信息的更新之后，接下来就是写入数据。

	type contextWriter interface {
		WritePointsWithContext(context.Context, []models.Point) error
	}
	switch eng := engine.(type) {
	case contextWriter:
		if err := eng.WritePointsWithContext(ctx, points); err != nil {
			atomic.AddInt64(&s.stats.WritePointsErr, int64(len(points)))
			atomic.AddInt64(&s.stats.WriteReqErr, 1)
			return fmt.Errorf("engine: %s", err)
		}
	default:
		// Write to the engine.
		if err := engine.WritePoints(points); err != nil {
			atomic.AddInt64(&s.stats.WritePointsErr, int64(len(points)))
			atomic.AddInt64(&s.stats.WriteReqErr, 1)
			return fmt.Errorf("engine: %s", err)
		}
	}

这里WritePointsWithContext的实现是Engine完成的。也就是Shard的WritePointsWithContext最后是Shard结构中的Engine结构实现。这个Engine结构就是常说的TSM(TimeSeries Engine)

engine的WritePointsWithContext

WritePointsWithContext首先遍历所有的点，然后把多个field的点，拆分成单个field的点。也就是把多值拆分成单值 这里是非常重要的逻辑。由于逻辑比较长，我简化一下：

	for _, p := range points {
		// Beforehand we should measure the performance impact.
		// key=measurement+tags
		keyBuf = append(keyBuf[:0], p.Key()...)
		keyBuf = append(keyBuf, keyFieldSeparator...)
		baseLen = len(keyBuf)
		iter := p.FieldIterator()
		t := p.Time().UnixNano()
		npoints++
		for iter.Next() {
			// Skip fields name "time", they are illegal
			if bytes.Equal(iter.FieldKey(), timeBytes) {
				continue
			}
			// key=key+fieldKey
			keyBuf = append(keyBuf[:baseLen], iter.FieldKey()...)
			if e.seriesTypeMap != nil {
				// Fast-path check to see if the field for the series already exists.
				if v, ok := e.seriesTypeMap.Get(keyBuf); !ok {
					if typ, err := e.Type(keyBuf); err != nil {
						// Field type is unknown, we can try to add it.
					} else if typ != iter.Type() {
						// Existing type is different from what was passed in, we need to drop
						// this write and refresh the series type map.
						seriesErr = tsdb.ErrFieldTypeConflict
						e.seriesTypeMap.Insert(keyBuf, int(typ))
						continue
					}

					// Doesn't exist, so try to insert
					vv, ok := e.seriesTypeMap.Insert(keyBuf, int(iter.Type()))

					// We didn't insert and the type that exists isn't what we tried to insert, so
					// we have a conflict and must drop this field/series.
					if !ok || vv != int(iter.Type()) {
						seriesErr = tsdb.ErrFieldTypeConflict
						continue
					}
				} else if v != int(iter.Type()) {
					// The series already exists, but with a different type.  This is also a type conflict
					// and we need to drop this field/series.
					seriesErr = tsdb.ErrFieldTypeConflict
					continue
				}
			}
			var v Value
			// 这里省略了switch case逻辑。
			NewValue(t,iter.Value)
			nvalues++
			values[string(keyBuf)] = append(values[string(keyBuf)], v)
		}
}

keyBuf和baseLen是两个关键信息，keyBuf开始就是保存了Key的信息，baseLen是key的长度。在接下来的这行代码中：

keyBuf = append(keyBuf[:baseLen], iter.FieldKey()...)

keyBuf被append了fieldKey，那么这个就变成了series。接下来是对series的校验，如果不存在就创建。注意这里也在创建series，在上面的validateSeriesAndFields中，也会创建。但是这里不会重复创建。创建完成后，构建Value结构。Value结构是两部分构成：time和fieldValue。 然后把这个信息保存下来。

数据写入

在把多个field拆分为单个field之后，接下来就是值的写入，写入到Cache和WAL里面

	e.mu.RLock()
	defer e.mu.RUnlock()

	// first try to write to the cache
	if err := e.Cache.WriteMulti(values); err != nil {
		return err
	}

	if e.WALEnabled {
		if _, err := e.WAL.WriteMulti(values); err != nil {
			return err
		}
	}

	// if requested, store points written stats
	if pointsWritten, ok := ctx.Value(tsdb.StatPointsWritten).(*int64); ok {
		*pointsWritten = npoints
	}

	// if requested, store values written stats
	if valuesWritten, ok := ctx.Value(tsdb.StatValuesWritten).(*int64); ok {
		*valuesWritten = nvalues
	}

到这里整体的写入逻辑完成。

总结

这篇文章解析了写入链路上的一些细节内容，回答了解析过程中中的状态机设计，元数据的校验和更新，以及最后把多field拆分成单个field来存储。但是也有很多遗留的问题。

index模块
series模块
engine的wal和cache，是怎么写入的。
在后续的文章里，会首先介绍series模块，然后是index，最后再从全局来分析，influxdb的存储结构。

MatrixYg

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
influxdb源码解析-数据写入细节

前言  ~~  这是一个分析inlfuxdb源码的系列。在此之前，已经分析了数据的基本模型，以及写入流程。在上一章数据写入部分，我们分析的是数据写入的基本流程，怎么从一个http的请求解析数据，然后计算shardgroup，shard等元数据信息，最后写入到具体的shard上。这一章分析数据写入的细节。influxdb安装和使用influxdb概念详解1influxdb概念详解2influxdb源码编译influxdb启动分析influxdb源码分析-
复制链接

扫一扫

专栏目录