mongo ObjectID 分析

perfectnewer

已于 2023-05-19 20:24:27 修改

阅读量376

点赞数

文章标签： mongodb 数据库

于 2023-05-19 20:22:06 首次发布

本文链接：https://blog.csdn.net/perfectnewer/article/details/130754683

版权

文章分析了MongoDBObjectId的生成算法，确保其全局唯一性。4字节时间戳+5字节随机值（Python中基于机器和进程，Go中纯随机）+3字节自增计数器，共同避免冲突。即使在高并发环境下，冲突概率极低。Python中生成16777216个ObjectId需要约50倍于单纯计数的时间。文章结论是ObjectId天然唯一，适用于大部分场景。

摘要由CSDN通过智能技术生成

MongoDB ObjectID 分析

先说一下原由吧，准备对旧库做改造，涉及到把以前n(百万级)张表的数据迁移到m（1<=m<=100）张表中，那么需要确认一个问题，就是会不会有id的冲突，mongodb的id算法中有没有collection的信息来确保所有collection中id也是唯一的。
首先上结论，不会有冲突，也没有collection信息。或者说，如果会有冲突，collection对其减轻冲突的作用可以忽略。

官方的算法要求

A 4-byte timestamp, representing the ObjectId's creation, measured in seconds since the Unix epoch.
A 5-byte random value generated once per process. This random value is unique to the machine and process.
A 3-byte incrementing counter, initialized to a random value.

4字节时间戳+5字节随机值（要求机器和进程可以唯一）+3字节自增

4字节时间戳可以保证一定程度的自增，和秒级的唯一，到达2106年会耗尽。
5字节随机值看起来似乎让人感觉有点隐患，但是算法去保证机器和进程的唯一还是可以做到的。
3字节的自增值，外加一个初始化的随机，进一步减少冲突的概率。每秒一共可以产生16777216个值，每毫秒16777.216

对于这个自增是什么概念呢，举python代码示例一下

In [9]: def func():
   ...:     st = time.time()
   ...:     for i in range(16777):
   ...:         i += 1
   ...:     return (time.time() - st) * 1000
In [12]: func()
Out[12]: 2.1560192108154297
In [14]: def func():
    ...:     st = time.time()
    ...:     for i in range(16777):
    ...:         objectid.ObjectId()
    ...:     return (time.time() - st) * 1000
In [15]: func()
Out[15]: 49.735069274902344

也就是说，python的生成算法，要再快50倍什么都不干，只产生id才可能冲突。

python

`_id` 算法

平常使用的时候都是from bson import objectid，所以，查找到相关的文件 bson/objectid.py


class ObjectId(object):
    """A MongoDB ObjectId.
    """

    _inc = random.randint(0, 0xFFFFFF)
    _inc_lock = threading.Lock()

    _machine_bytes = _machine_bytes()

    __slots__ = ('__id')

    _type_marker = 7

    def __init__(self, oid=None):
        """Initialize a new ObjectId.

        An ObjectId is a 12-byte unique identifier consisting of:

          - a 4-byte value representing the seconds since the Unix epoch,
          - a 3-byte machine identifier,
          - a 2-byte process id, and
          - a 3-byte counter, starting with a random value.
        ...
        """
        if oid is None:
            self.__generate()
        else:
            self.__validate(oid)

首先看注释。和mongodb id生成的算法规则是一致的。接下来，看到没有传oid的时候，会调用函数生成

    def __generate(self):
        """Generate a new value for this ObjectId.
        """
        oid = EMPTY

        # 4 bytes current time
        oid += struct.pack(">i", int(time.time()))

        # 3 bytes machine
        oid += ObjectId._machine_bytes

        # 2 bytes pid
        oid += struct.pack(">H", os.getpid() % 0xFFFF)

        # 3 bytes inc
        ObjectId._inc_lock.acquire()
        oid += struct.pack(">i", ObjectId._inc)[1:4]
        ObjectId._inc = (ObjectId._inc + 1) % 0xFFFFFF
        ObjectId._inc_lock.release()

        self.__id = oid

其中 _machine_bytes

def _machine_bytes():
    """Get the machine portion of an ObjectId.
    """
    machine_hash = _md5func()
    if PY3:
        # gethostname() returns a unicode string in python 3.x
        # while update() requires a byte string.
        machine_hash.update(socket.gethostname().encode())
    else:
        # Calling encode() here will fail with non-ascii hostnames
        machine_hash.update(socket.gethostname())
    return machine_hash.digest()[0:3]

那么，其中machine和pid有没有可能重呢，有概率，但完全可以预先检查machine hash规避掉。（不过最新版的中间5字节已经由完全随机的算法替代了）

`_id` 由谁生成

首先下载pymongo源码，我这里参照我们线上版本下载了2.8的

pip download pymongo==2.8
tar -xf

打开源码目录，从collection看起，因为插入操作中会涉及相关逻辑。直接搜到

class Collection(common.BaseObject):
    """A Mongo collection.
    """

    def insert(self, doc_or_docs, manipulate=True,
               safe=None, check_keys=True, continue_on_error=False, **kwargs):
        """Insert a document(s) into this collection.

        If `manipulate` is ``True``, the document(s) are manipulated using
        any :class:`~pymongo.son_manipulator.SONManipulator` instances
        that have been added to this :class:`~pymongo.database.Database`.
        In this case an ``"_id"`` will be added if the document(s) does
        not already contain one and the ``"id"`` (or list of ``"_id"``
        values for more than one document) will be returned.
        ...
        """
                client = self.database.connection
        # Batch inserts require us to know the connected primary's
        # max_bson_size, max_message_size, and max_write_batch_size.
        # We have to be connected to the primary to know that.
        client._ensure_connected(True)

        docs = doc_or_docs
        return_one = False
        if isinstance(docs, dict):
            return_one = True
            docs = [docs]

        ids = []

        if manipulate:
            def gen():
                db = self.__database
                for doc in docs:
                    # Apply user-configured SON manipulators. This order of
                    # operations is required for backwards compatibility,
                    # see PYTHON-709.
                    doc = db._apply_incoming_manipulators(doc, self)
                    if '_id' not in doc:
                        doc['_id'] = ObjectId()

                    doc = db._apply_incoming_copying_manipulators(doc, self)
                    ids.append(doc['_id'])
                    yield doc
        else:
            def gen():
                for doc in docs:
                    ids.append(doc.get('_id'))
                    yield doc

        safe, options = self._get_write_mode(safe, **kwargs)

        if client.max_wire_version > 1 and safe:
        ...

由此可见，默认情况下，ObjectID是由客户端生成的，只有用户指定manipulate为False，_id在没有的情况下，才会由server返回。

go

var objectIDCounter = readRandomUint32()
var processUnique = processUniqueBytes()

// NewObjectIDFromTimestamp generates a new ObjectID based on the given time.
func NewObjectIDFromTimestamp(timestamp time.Time) ObjectID {
	var b [12]byte

	binary.BigEndian.PutUint32(b[0:4], uint32(timestamp.Unix()))
	copy(b[4:9], processUnique[:])
	putUint24(b[9:12], atomic.AddUint32(&objectIDCounter, 1))

	return b

func processUniqueBytes() [5]byte {
	var b [5]byte
	_, err := io.ReadFull(rand.Reader, b[:])
	if err != nil {
		panic(fmt.Errorf("cannot initialize objectid package with crypto.rand.Reader: %v", err))
	}

	return b
}

func readRandomUint32() uint32 {
	var b [4]byte
	_, err := io.ReadFull(rand.Reader, b[:])
	if err != nil {
		panic(fmt.Errorf("cannot initialize objectid package with crypto.rand.Reader: %v", err))
	}

	return (uint32(b[0]) << 0) | (uint32(b[1]) << 8) | (uint32(b[2]) << 16) | (uint32(b[3]) << 24)
}