先说一下原由吧,准备对旧库做改造,涉及到把以前n(百万级)张表的数据迁移到m(1<=m<=100)张表中,那么需要确认一个问题,就是会不会有id的冲突,mongodb的id算法中有没有collection的信息来确保所有collection中id也是唯一的。
首先上结论,不会有冲突,也没有collection信息。或者说,如果会有冲突,collection对其减轻冲突的作用可以忽略。
官方的算法要求
A 4-byte timestamp, representing the ObjectId's creation, measured in seconds since the Unix epoch.
A 5-byte random value generated once per process. This random value is unique to the machine and process.
A 3-byte incrementing counter, initialized to a random value.
4字节时间戳+5字节随机值(要求机器和进程可以唯一)+3字节自增
4字节时间戳可以保证一定程度的自增,和秒级的唯一,到达2106年会耗尽。
5字节随机值看起来似乎让人感觉有点隐患,但是算法去保证机器和进程的唯一还是可以做到的。
3字节的自增值,外加一个初始化的随机,进一步减少冲突的概率。每秒一共可以产生16777216个值,每毫秒16777.216
对于这个自增是什么概念呢,举python代码示例一下
In [9]: def func():
...: st = time.time()
...: for i in range(16777):
...: i += 1
...: return (time.time() - st) * 1000
In [12]: func()
Out[12]: 2.1560192108154297
In [14]: def func():
...: st = time.time()
...: for i in range(16777):
...: objectid.ObjectId()
...: return (time.time() - st) * 1000
In [15]: func()
Out[15]: 49.735069274902344
也就是说,python的生成算法,要再快50倍什么都不干,只产生id才可能冲突。
python
_id
算法
平常使用的时候都是from bson import objectid
,所以,查找到相关的文件 bson/objectid.py
class ObjectId(object):
"""A MongoDB ObjectId.
"""
_inc = random.randint(0, 0xFFFFFF)
_inc_lock = threading.Lock()
_machine_bytes = _machine_bytes()
__slots__ = ('__id')
_type_marker = 7
def __init__(self, oid=None):
"""Initialize a new ObjectId.
An ObjectId is a 12-byte unique identifier consisting of:
- a 4-byte value representing the seconds since the Unix epoch,
- a 3-byte machine identifier,
- a 2-byte process id, and
- a 3-byte counter, starting with a random value.
...
"""
if oid is None:
self.__generate()
else:
self.__validate(oid)
首先看注释。和mongodb id生成的算法规则是一致的。接下来,看到没有传oid的时候,会调用函数生成
def __generate(self):
"""Generate a new value for this ObjectId.
"""
oid = EMPTY
# 4 bytes current time
oid += struct.pack(">i", int(time.time()))
# 3 bytes machine
oid += ObjectId._machine_bytes
# 2 bytes pid
oid += struct.pack(">H", os.getpid() % 0xFFFF)
# 3 bytes inc
ObjectId._inc_lock.acquire()
oid += struct.pack(">i", ObjectId._inc)[1:4]
ObjectId._inc = (ObjectId._inc + 1) % 0xFFFFFF
ObjectId._inc_lock.release()
self.__id = oid
其中 _machine_bytes
def _machine_bytes():
"""Get the machine portion of an ObjectId.
"""
machine_hash = _md5func()
if PY3:
# gethostname() returns a unicode string in python 3.x
# while update() requires a byte string.
machine_hash.update(socket.gethostname().encode())
else:
# Calling encode() here will fail with non-ascii hostnames
machine_hash.update(socket.gethostname())
return machine_hash.digest()[0:3]
那么,其中machine和pid有没有可能重呢,有概率,但完全可以预先检查machine hash规避掉。(不过最新版的中间5字节已经由完全随机的算法替代了)
_id
由谁生成
首先下载pymongo源码,我这里参照我们线上版本下载了2.8的
pip download pymongo==2.8
tar -xf
打开源码目录,从collection看起,因为插入操作中会涉及相关逻辑。直接搜到
class Collection(common.BaseObject):
"""A Mongo collection.
"""
def insert(self, doc_or_docs, manipulate=True,
safe=None, check_keys=True, continue_on_error=False, **kwargs):
"""Insert a document(s) into this collection.
If `manipulate` is ``True``, the document(s) are manipulated using
any :class:`~pymongo.son_manipulator.SONManipulator` instances
that have been added to this :class:`~pymongo.database.Database`.
In this case an ``"_id"`` will be added if the document(s) does
not already contain one and the ``"id"`` (or list of ``"_id"``
values for more than one document) will be returned.
...
"""
client = self.database.connection
# Batch inserts require us to know the connected primary's
# max_bson_size, max_message_size, and max_write_batch_size.
# We have to be connected to the primary to know that.
client._ensure_connected(True)
docs = doc_or_docs
return_one = False
if isinstance(docs, dict):
return_one = True
docs = [docs]
ids = []
if manipulate:
def gen():
db = self.__database
for doc in docs:
# Apply user-configured SON manipulators. This order of
# operations is required for backwards compatibility,
# see PYTHON-709.
doc = db._apply_incoming_manipulators(doc, self)
if '_id' not in doc:
doc['_id'] = ObjectId()
doc = db._apply_incoming_copying_manipulators(doc, self)
ids.append(doc['_id'])
yield doc
else:
def gen():
for doc in docs:
ids.append(doc.get('_id'))
yield doc
safe, options = self._get_write_mode(safe, **kwargs)
if client.max_wire_version > 1 and safe:
...
由此可见,默认情况下,ObjectID是由客户端生成的,只有用户指定manipulate为False,_id在没有的情况下,才会由server返回。
go
var objectIDCounter = readRandomUint32()
var processUnique = processUniqueBytes()
// NewObjectIDFromTimestamp generates a new ObjectID based on the given time.
func NewObjectIDFromTimestamp(timestamp time.Time) ObjectID {
var b [12]byte
binary.BigEndian.PutUint32(b[0:4], uint32(timestamp.Unix()))
copy(b[4:9], processUnique[:])
putUint24(b[9:12], atomic.AddUint32(&objectIDCounter, 1))
return b
func processUniqueBytes() [5]byte {
var b [5]byte
_, err := io.ReadFull(rand.Reader, b[:])
if err != nil {
panic(fmt.Errorf("cannot initialize objectid package with crypto.rand.Reader: %v", err))
}
return b
}
func readRandomUint32() uint32 {
var b [4]byte
_, err := io.ReadFull(rand.Reader, b[:])
if err != nil {
panic(fmt.Errorf("cannot initialize objectid package with crypto.rand.Reader: %v", err))
}
return (uint32(b[0]) << 0) | (uint32(b[1]) << 8) | (uint32(b[2]) << 16) | (uint32(b[3]) << 24)
}
mongo官方最新版,go的算法里,中间5字节已经变成了纯随机的,那么go的是啥速度呢?
for循环生成上限数量16777216只需要264.349586ms
结论
objectid是天然唯一的,不同collection和不同的db都可以做到,当然,如果你们的业务可以到达每秒产生千万级别的id,当我没说。