log_checkpoint_interval和log_checkpoint_timeout

9i以后可能大家都喜欢通过设置fast_start_mttr_target来控制instance recovery的粒度。但是仍然有两个参数一直影响着我们的checkpoint,就是log_checkpoint_interval和log_checkpoint_timeout  
log_checkpoint_interval
Oracle8.1版本后log_checkpoint_interval指的是两次checkpoint之间操作系统数据块的个数。checkpoint时Oracle把内存里修改过的数据块用DBWR写到物理文件,用LGWR写到日志(在8i的时候lgwr进程在兼有ckpt进程的作用,呵呵。为了减轻我们本来就可能在高压情况下疲于奔命的LGWR兄弟的负担,Oracle引入了ckpt来更新我们的控制文件和数据文件头的SCN信息)。  
一般UNIX操作系统的数据块为512bytes。  
从性能优化的角度来说,建议log_checkpoint_interval=redologfilesizebytes/512bytes,根据我们的online redo file的大小来指定我们数据块的个数.假设LOG_CHECKPOINT_INTERVAL 设置为2000,也就是说,当产生的日志量1m之后,就会触发一个checkpoint。我们可以通过设置log_checkpoints_to_alert =true进行观察测试。
from concept:
LOG_CHECKPOINT_INTERVAL specifies the frequency of checkpoints(用来指定检查点发生的频率) in terms of the number of redo log file blocks that can exist between an incremental checkpoint and the last block written to the redo log. This number refers to physical operating system blocks, not database blocks.
Regardless of this value, a checkpoint always occurs when switching from one online redo log file to another. Therefore, if the value exceeds the actual redo log file size, checkpoints occur only when switching logs. Checkpoint frequency is one of the factors that influence the time required for the database to recover from an unexpected failure. 
log_checkpoint_timeout  
Oracle8.1版本后log_checkpoint_timeout指的是两次checkpoint之间时间秒数(单位是秒)。  
Oracle建议不用这个参数来控制,因为事务(transaction)大小不是按时间等量分布的(事务的长短并不是最重要的,重要的是我们的业务逻辑和数据的完整性)。那么我们用log_checkpoint_interval参数控制会更好一些。  
 我们可以通过log_checkpoint_timeout=0来禁用此参数或者按默认的900。  
LOG_CHECKPOINT_TIMEOUT specifies (in seconds) the amount of time that has passed since the incremental checkpoint at the position where the last write to the redo log (sometimes called the tail of the log) occurred. This parameter also signifies that no buffer will remain dirty (in the cache) for more than integer seconds.
Specifying a value of 0 for the timeout disables time-based checkpoints. Hence, setting the value to 0 is not recommended unless FAST_START_MTTR_TARGET is set.

感谢惜分飞的帮助
参考至:http://space.itpub.net/12361284/viewspace-150449
如有错误,欢迎指正
邮箱:czmcj@163.com

/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/tyro/_parsers.py:332: UserWarning: The field `model.action-expert-variant` is annotated with type `typing.Literal['dummy', 'gemma_300m', 'gemma_2b', 'gemma_2b_lora']`, but the default value `gemma_300m_lora` has type `<class 'str'>`. We'll try to handle this gracefully, but it may cause unexpected behavior. warnings.warn(message) 19:07:30.004 [I] Running on: shuo-hp (10287:train.py:195) INFO:2025-05-12 19:07:30,228:jax._src.xla_bridge:945: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' 19:07:30.228 [I] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' (10287:xla_bridge.py:945) INFO:2025-05-12 19:07:30,228:jax._src.xla_bridge:945: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory 19:07:30.228 [I] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory (10287:xla_bridge.py:945) 19:07:30.500 [I] Wiped checkpoint directory /home/shuo/VLA/openpi/checkpoints/pi0_ours_aloha/your_experiment_name (10287:checkpoints.py:25) 19:07:30.500 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (10287:base_pytree_checkpoint_handler.py:332) 19:07:30.500 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (10287:base_pytree_checkpoint_handler.py:332) 19:07:30.500 [I] [thread=MainThread] Failed to get flag value for EXPERIMENTAL_ORBAX_USE_DISTRIBUTED_PROCESS_ID. (10287:multihost.py:375) 19:07:30.500 [I] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'assets': <openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>, 'train_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>, 'params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>}, handler_registry=None (10287:checkpoint_manager.py:622) 19:07:30.501 [I] Deferred registration for item: "assets". Adding handler `<openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>` for item "assets" and save args `<class 'openpi.training.checkpoints.CallbackSave'>` and restore args `<class 'openpi.training.checkpoints.CallbackRestore'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239) 19:07:30.501 [I] Deferred registration for item: "train_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>` for item "train_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239) 19:07:30.501 [I] Deferred registration for item: "params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>` for item "params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239) 19:07:30.501 [I] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x72e5cad7fd10>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239) 19:07:30.501 [I] Initialized registry DefaultCheckpointHandlerRegistry({('assets', <class 'openpi.training.checkpoints.CallbackSave'>): <openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>, ('assets', <class 'openpi.training.checkpoints.CallbackRestore'>): <openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x72e5cad7fd10>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x72e5cad7fd10>}). (10287:composite_checkpoint_handler.py:508) 19:07:30.501 [I] orbax-checkpoint version: 0.11.1 (10287:abstract_checkpointer.py:35) 19:07:30.501 [I] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x72e5cacb85e0> timeout: 7200 secs and primary_host=0 for async checkpoint writes (10287:async_checkpointer.py:80) 19:07:30.501 [I] Found 0 checkpoint steps in /home/shuo/VLA/openpi/checkpoints/pi0_ours_aloha/your_experiment_name (10287:checkpoint_manager.py:1528) 19:07:30.501 [I] Saving root metadata (10287:checkpoint_manager.py:1569) 19:07:30.501 [I] [process=0][thread=MainThread] Skipping global process sync, barrier name: CheckpointManager:save_metadata (10287:multihost.py:293) 19:07:30.501 [I] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=1, keep_time_interval=None, keep_period=5000, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=False, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=AsyncOptions(timeout_secs=7200, barrier_sync_fn=None, post_finalization_callback=None, create_directories_asynchronously=False), multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None), root_directory=/home/shuo/VLA/openpi/checkpoints/pi0_ours_aloha/your_experiment_name: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x72e5cadffd10> (10287:checkpoint_manager.py:797) 19:07:30.553 [I] Loaded norm stats from s3://openpi-assets/checkpoints/pi0_base/assets/trossen (10287:config.py:166) Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). 19:07:30.553 [W] Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). (10287:_snapshot_download.py:213) Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). 19:07:30.554 [W] Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). (10287:_snapshot_download.py:213) Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). 19:07:30.555 [W] Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). (10287:_snapshot_download.py:213) Traceback (most recent call last): File "/home/shuo/VLA/openpi/scripts/train.py", line 273, in <module> main(_config.cli()) File "/home/shuo/VLA/openpi/scripts/train.py", line 226, in main batch = next(data_iter) ^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/src/openpi/training/data_loader.py", line 177, in __iter__ for batch in self._data_loader: File "/home/shuo/VLA/openpi/src/openpi/training/data_loader.py", line 257, in __iter__ batch = next(data_iter) ^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 708, in __next__ data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_data return self._process_data(data) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data data.reraise() File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/_utils.py", line 733, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] ^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] ~~~~~~~~~~~~^^^^^ File "/home/shuo/VLA/openpi/src/openpi/training/data_loader.py", line 47, in __getitem__ return self._transform(self._dataset[index]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/src/openpi/transforms.py", line 70, in __call__ data = transform(data) ^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/src/openpi/transforms.py", line 101, in __call__ return jax.tree.map(lambda k: flat_item[k], self.structure) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/jax/_src/tree.py", line 155, in map return tree_util.tree_map(f, tree, *rest, is_leaf=is_leaf) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 358, in tree_map return treedef.unflatten(f(*xs) for xs in zip(*all_leaves)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 358, in <genexpr> return treedef.unflatten(f(*xs) for xs in zip(*all_leaves)) ^^^^^^ File "/home/shuo/VLA/openpi/src/openpi/transforms.py", line 101, in <lambda> return jax.tree.map(lambda k: flat_item[k], self.structure) ~~~~~~~~~^^^ KeyError: 'observation.images.cam_low'
最新发布
05-13
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值