mongo ObjectID 分析

文章分析了MongoDBObjectId的生成算法,确保其全局唯一性。4字节时间戳+5字节随机值(Python中基于机器和进程,Go中纯随机)+3字节自增计数器,共同避免冲突。即使在高并发环境下,冲突概率极低。Python中生成16777216个ObjectId需要约50倍于单纯计数的时间。文章结论是ObjectId天然唯一,适用于大部分场景。
摘要由CSDN通过智能技术生成

先说一下原由吧,准备对旧库做改造,涉及到把以前n(百万级)张表的数据迁移到m(1<=m<=100)张表中,那么需要确认一个问题,就是会不会有id的冲突,mongodb的id算法中有没有collection的信息来确保所有collection中id也是唯一的。
首先上结论,不会有冲突,也没有collection信息。或者说,如果会有冲突,collection对其减轻冲突的作用可以忽略。

官方的算法要求

A 4-byte timestamp, representing the ObjectId's creation, measured in seconds since the Unix epoch.
A 5-byte random value generated once per process. This random value is unique to the machine and process.
A 3-byte incrementing counter, initialized to a random value.

4字节时间戳+5字节随机值(要求机器和进程可以唯一)+3字节自增

4字节时间戳可以保证一定程度的自增,和秒级的唯一,到达2106年会耗尽。
5字节随机值看起来似乎让人感觉有点隐患,但是算法去保证机器和进程的唯一还是可以做到的。
3字节的自增值,外加一个初始化的随机,进一步减少冲突的概率。每秒一共可以产生16777216个值,每毫秒16777.216

对于这个自增是什么概念呢,举python代码示例一下

In [9]: def func():
   ...:     st = time.time()
   ...:     for i in range(16777):
   ...:         i += 1
   ...:     return (time.time() - st) * 1000
In [12]: func()
Out[12]: 2.1560192108154297
In [14]: def func():
    ...:     st = time.time()
    ...:     for i in range(16777):
    ...:         objectid.ObjectId()
    ...:     return (time.time() - st) * 1000
In [15]: func()
Out[15]: 49.735069274902344

也就是说,python的生成算法,要再快50倍什么都不干,只产生id才可能冲突。

python

_id 算法

平常使用的时候都是from bson import objectid,所以,查找到相关的文件 bson/objectid.py


class ObjectId(object):
    """A MongoDB ObjectId.
    """

    _inc = random.randint(0, 0xFFFFFF)
    _inc_lock = threading.Lock()

    _machine_bytes = _machine_bytes()

    __slots__ = ('__id')

    _type_marker = 7

    def __init__(self, oid=None):
        """Initialize a new ObjectId.

        An ObjectId is a 12-byte unique identifier consisting of:

          - a 4-byte value representing the seconds since the Unix epoch,
          - a 3-byte machine identifier,
          - a 2-byte process id, and
          - a 3-byte counter, starting with a random value.
        ...
        """
        if oid is None:
            self.__generate()
        else:
            self.__validate(oid)

首先看注释。和mongodb id生成的算法规则是一致的。接下来,看到没有传oid的时候,会调用函数生成

    def __generate(self):
        """Generate a new value for this ObjectId.
        """
        oid = EMPTY

        # 4 bytes current time
        oid += struct.pack(">i", int(time.time()))

        # 3 bytes machine
        oid += ObjectId._machine_bytes

        # 2 bytes pid
        oid += struct.pack(">H", os.getpid() % 0xFFFF)

        # 3 bytes inc
        ObjectId._inc_lock.acquire()
        oid += struct.pack(">i", ObjectId._inc)[1:4]
        ObjectId._inc = (ObjectId._inc + 1) % 0xFFFFFF
        ObjectId._inc_lock.release()

        self.__id = oid

其中 _machine_bytes

def _machine_bytes():
    """Get the machine portion of an ObjectId.
    """
    machine_hash = _md5func()
    if PY3:
        # gethostname() returns a unicode string in python 3.x
        # while update() requires a byte string.
        machine_hash.update(socket.gethostname().encode())
    else:
        # Calling encode() here will fail with non-ascii hostnames
        machine_hash.update(socket.gethostname())
    return machine_hash.digest()[0:3]

那么,其中machine和pid有没有可能重呢,有概率,但完全可以预先检查machine hash规避掉。(不过最新版的中间5字节已经由完全随机的算法替代了)

_id 由谁生成

首先下载pymongo源码,我这里参照我们线上版本下载了2.8的

pip download pymongo==2.8
tar -xf 

打开源码目录,从collection看起,因为插入操作中会涉及相关逻辑。直接搜到

class Collection(common.BaseObject):
    """A Mongo collection.
    """

    def insert(self, doc_or_docs, manipulate=True,
               safe=None, check_keys=True, continue_on_error=False, **kwargs):
        """Insert a document(s) into this collection.

        If `manipulate` is ``True``, the document(s) are manipulated using
        any :class:`~pymongo.son_manipulator.SONManipulator` instances
        that have been added to this :class:`~pymongo.database.Database`.
        In this case an ``"_id"`` will be added if the document(s) does
        not already contain one and the ``"id"`` (or list of ``"_id"``
        values for more than one document) will be returned.
        ...
        """
                client = self.database.connection
        # Batch inserts require us to know the connected primary's
        # max_bson_size, max_message_size, and max_write_batch_size.
        # We have to be connected to the primary to know that.
        client._ensure_connected(True)

        docs = doc_or_docs
        return_one = False
        if isinstance(docs, dict):
            return_one = True
            docs = [docs]

        ids = []

        if manipulate:
            def gen():
                db = self.__database
                for doc in docs:
                    # Apply user-configured SON manipulators. This order of
                    # operations is required for backwards compatibility,
                    # see PYTHON-709.
                    doc = db._apply_incoming_manipulators(doc, self)
                    if '_id' not in doc:
                        doc['_id'] = ObjectId()

                    doc = db._apply_incoming_copying_manipulators(doc, self)
                    ids.append(doc['_id'])
                    yield doc
        else:
            def gen():
                for doc in docs:
                    ids.append(doc.get('_id'))
                    yield doc

        safe, options = self._get_write_mode(safe, **kwargs)

        if client.max_wire_version > 1 and safe:
        ...

由此可见,默认情况下,ObjectID是由客户端生成的,只有用户指定manipulate为False,_id在没有的情况下,才会由server返回。

go

var objectIDCounter = readRandomUint32()
var processUnique = processUniqueBytes()

// NewObjectIDFromTimestamp generates a new ObjectID based on the given time.
func NewObjectIDFromTimestamp(timestamp time.Time) ObjectID {
	var b [12]byte

	binary.BigEndian.PutUint32(b[0:4], uint32(timestamp.Unix()))
	copy(b[4:9], processUnique[:])
	putUint24(b[9:12], atomic.AddUint32(&objectIDCounter, 1))

	return b

func processUniqueBytes() [5]byte {
	var b [5]byte
	_, err := io.ReadFull(rand.Reader, b[:])
	if err != nil {
		panic(fmt.Errorf("cannot initialize objectid package with crypto.rand.Reader: %v", err))
	}

	return b
}

func readRandomUint32() uint32 {
	var b [4]byte
	_, err := io.ReadFull(rand.Reader, b[:])
	if err != nil {
		panic(fmt.Errorf("cannot initialize objectid package with crypto.rand.Reader: %v", err))
	}

	return (uint32(b[0]) << 0) | (uint32(b[1]) << 8) | (uint32(b[2]) << 16) | (uint32(b[3]) << 24)
}

mongo官方最新版,go的算法里,中间5字节已经变成了纯随机的,那么go的是啥速度呢?
for循环生成上限数量16777216只需要264.349586ms

结论

objectid是天然唯一的,不同collection和不同的db都可以做到,当然,如果你们的业务可以到达每秒产生千万级别的id,当我没说。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值