acting set 通过crush算法计算出来的osd
up set 正在提供读写服务的osd
当出现osd down掉或者发生数据迁移时actiong set会重新设置,把数据迁移到acting set中,而使用up set提供读写服务
peering完毕不意味着所有副本都具有相同数据,只是标明所有副本达成了一致(版本,主从关系等)
ACTIVE
代表peering已经完毕,代表primary osd具有有效数据,pg可以提供读写服务
CLEAN
代表peering已经完毕,并且所有副本数据no stray
DEGRADED
写完primary osd,未写完其他osd时,pg处于degraded状态
pg中部分osd 处于down状态,则pg处于active+degranded状态
如果osd处于down状态并且pg长期处于degranded状态,osd将被标为为out,并且重新映射数据。down到out的时间通过mon osd down out interval控制
当pg中部分对象无法找到或读写是,pg被标记为degranded状态,此时pg中其他对象仍然是可以访问
RECOVERING
pg中osd在down之后重新up(未发生重新映射),他的数据已经落后于其他副本,需要更新到最新状态,此时pg被标记为recovering
BACK FILLING
当新osd加入集群之后,pg需要重新分配到新的osd上,当backfilling完成之后,新的osd可以服务请求。
REMAPPED
pg发生了重新remap,数据从旧的acting set向新的acting set迁移,在迁移过程中,发给新primary osd的请求会转发给旧primary osd,直到迁移完成
问题:这些状态的关系和状态迁移
STALE
pg的primary osd一段时间未向monitor汇报
As previously noted, a placement group is not necessarily problematic just because its state is not active+clean. Generally, Ceph’s ability to self repair may not be working when placement groups get stuck. The stuck states include:
Unclean: Placement groups contain objects that are not replicated the desired number of times. They should be recovering.
Inactive: Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come back up.
Stale: Placement groups are in an unknown state, because the OSDs that host them have not reported to the monitor cluster in a while (configured by mon osd report timeout).
问题:
读写操作为什么需要通过primary osd进行?有哪些场景会导致数据不一致?