mysql源码解读——数据到文件之日志

fpcc

于 2021-08-22 11:14:41 发布

阅读量507

点赞数

分类专栏：数据库开发文章标签： mysql

本文链接：https://blog.csdn.net/fpcc/article/details/119850084

版权

数据库开发专栏收录该内容

47 篇文章 64 订阅

订阅专栏

一、数据库

数据库，本身有一个库，那应该是有自己的库的管理方式，这种传统的关系型数据库是如何把数据存储到硬盘上呢？文件的组织形式有哪些呢？MySql数据库一般要有两类文件落盘，一类是日志型文件，一类是真正的数据文件，在数据文件中，又包含索引数据和真实数据。这也是经常提到聚簇索引和非聚簇索引的主要原因，因为这两种索引，在硬盘中的存储方式是不一样的。前者本身就是数据的顺序集合，后者是需要二次再通过指针才能查找数据且数据是无法和索引同顺序的。
或者这样说可能更明白，由于索引是通过B+树构造的，聚簇索引本身的叶子节点就是数据，而非聚簇索引叶子节点是指向数据的指针。
数据库一般通过WAL来实现数据的安全落盘，也就是预写日志的方式，MySql数据修改只会操作内存中的数据页和redo log，真实的数据落盘是需要后台线程异步处理，由于有日志的一致性存在，理论上数据不会出问题。
下面就分析一下这个日志文件的落盘过程。

二、日志落盘

在MySql中，早期是没有InnoDB引擎的，自带的引擎是MyISAM引擎，所以MySql的服务层有自己的日志系统，也就是常说的binlog(也叫归档日志)。这里就不得不提到一个概念crash-safe，也就是即使数据库崩溃，数据仍然是安全的。而MyISAM引擎是没办法实现这个功能的。而后来引入的InnoDB引擎则实现了这个功能。
在整个SQL语句执行的过程中，需要有三个日志进行处理，即重做日志（redo log），回滚日志（undo log）以及归档日志（binlog）。这三个日志是保障整个MySql安全的前提。
重做日志：即事务日志，它是InnoDB引擎特有的，记录数据库中每个页的修改（注意，粒度是页而不是行），用来恢复提交后的物理页。这也是WAL（预写日志技术）的一个最典型的用例。注意，它是固定大小，循环读写的，
当记录满后，清除过程有可能会导致MySql的卡顿。
回滚日志：这个很好理解，就是万一事务没有成功要把数据恢复成原样，同时它还有一个功能就是提供MVCC行版本控制，这个在《数据密集型系统应用设计》中有详细介绍。MVCC可以用来保障原子性，从而保证数据的
安全性、完整性和一致性。
归档日志：这个也是传统的日志处理，它不是具体到哪个引擎自带的，而是整个MySql的Service层自带的。它主要记录对数据操作的Sql语句（不含查询），它的特点是一直记录，直到最大值（这个值默认1G但可以配置，超过后重新生成一个新的文件，依次类推。不过遇到事务恰好又到最大值，则继续扩大文件而不是生成新文件）。

需要说明的是，为了保证日志间特别是重做日志和归档日志的一致性，需要使用2PC（两阶段提交）来保证数据的安全性和一致性，这里不展开这个问题，有兴趣可以查找相关资料。

三、源码分析

还是看源码来分析一下，才能更清楚的明白。
先看一下相关的类定义(sql/tc_log.h or .cc)：

/**
  Transaction Coordinator Log.

  A base abstract class for three different implementations of the
  transaction coordinator.

  The server uses the transaction coordinator to order transactions
  correctly and there are three different implementations: one using
  an in-memory structure, one dummy that does not do anything, and one
  using the binary log for transaction coordination.
*/
class TC_LOG {
 public:
  /**
    Perform heuristic recovery, if --tc-heuristic-recover was used.

    @note no matter whether heuristic recovery was successful or not
    mysqld must exit. So, return value is the same in both cases.

    @retval false  no heuristic recovery was requested
    @retval true   heuristic recovery was performed
  */
  bool using_heuristic_recover();

  TC_LOG() {}
  virtual ~TC_LOG() {}

  enum enum_result { RESULT_SUCCESS, RESULT_ABORTED, RESULT_INCONSISTENT };

  /**
    Initialize and open the coordinator log.
    Do recovery if necessary. Called during server startup.

    @param opt_name  Name of logfile.

    @retval 0  sucess
    @retval 1  failed
  */
  virtual int open(const char *opt_name) = 0;

  /**
    Close the transaction coordinator log and free any resources.
    Called during server shutdown.
  */
  virtual void close() = 0;

  /**
     Log a commit record of the transaction to the transaction
     coordinator log.

     When the function returns, the transaction commit is properly
     logged to the transaction coordinator log and can be committed in
     the storage engines.

     @param thd Session to log transaction for.
     @param all @c True if this is a "real" commit, @c false if it is a
     "statement" commit.

     @return Error code on failure, zero on success.
   */
  virtual enum_result commit(THD *thd, bool all) = 0;

  /**
     Log a rollback record of the transaction to the transaction
     coordinator log.

     When the function returns, the transaction have been aborted in
     the transaction coordinator log.

     @param thd Session to log transaction record for.

     @param all @c true if an explicit commit or an implicit commit
     for a statement, @c false if an internal commit of the statement.

     @return Error code on failure, zero on success.
   */
  virtual int rollback(THD *thd, bool all) = 0;

  /**
     Log a prepare record of the transaction to the storage engines.

     @param thd Session to log transaction record for.

     @param all @c true if an explicit commit or an implicit commit
     for a statement, @c false if an internal commit of the statement.

     @return Error code on failure, zero on success.
   */
  virtual int prepare(THD *thd, bool all) = 0;
};

注释很清楚，三种类型的日志，内存、虚拟和二进制。
再看一下Binlog（sql/binlog.h or .cc）

/*
  TODO use mmap instead of IO_CACHE for binlog
  (mmap+fsync is two times faster than write+fsync)
*/
class MYSQL_BIN_LOG : public TC_LOG {
 public:
  class Binlog_ofile;

 private:
  enum enum_log_state { LOG_OPENED, LOG_CLOSED, LOG_TO_BE_OPENED };

  /* LOCK_log is inited by init_pthread_objects() */
  mysql_mutex_t LOCK_log;
  char *name;
  char log_file_name[FN_REFLEN];
  char db[NAME_LEN + 1];
  bool write_error, inited;
  Binlog_ofile *m_binlog_file;

  /** Instrumentation key to use for file io in @c log_file */
  PSI_file_key m_log_file_key;
  /** The instrumentation key to use for @ LOCK_log. */
  PSI_mutex_key m_key_LOCK_log;
  /** The instrumentation key to use for @ LOCK_index. */
  PSI_mutex_key m_key_LOCK_index;
  /** The instrumentation key to use for @ LOCK_binlog_end_pos. */
  PSI_mutex_key m_key_LOCK_binlog_end_pos;
  /** The PFS instrumentation key for @ LOCK_commit_queue. */
  PSI_mutex_key m_key_LOCK_commit_queue;
  /** The PFS instrumentation key for @ LOCK_done. */
  PSI_mutex_key m_key_LOCK_done;
  /** The PFS instrumentation key for @ LOCK_flush_queue. */
  PSI_mutex_key m_key_LOCK_flush_queue;
  /** The PFS instrumentation key for @ LOCK_sync_queue. */
  PSI_mutex_key m_key_LOCK_sync_queue;
  /** The PFS instrumentation key for @ COND_done. */
  PSI_mutex_key m_key_COND_done;
  /** The PFS instrumentation key for @ COND_flush_queue. */
  PSI_mutex_key m_key_COND_flush_queue;
  /** The instrumentation key to use for @ LOCK_commit. */
  PSI_mutex_key m_key_LOCK_commit;
  /** The instrumentation key to use for @ LOCK_sync. */
  PSI_mutex_key m_key_LOCK_sync;
  /** The instrumentation key to use for @ LOCK_xids. */
  PSI_mutex_key m_key_LOCK_xids;
  /** The instrumentation key to use for @ update_cond. */
  PSI_cond_key m_key_update_cond;
  /** The instrumentation key to use for @ prep_xids_cond. */
  PSI_cond_key m_key_prep_xids_cond;
  /** The instrumentation key to use for opening the log file. */
  PSI_file_key m_key_file_log;
  /** The instrumentation key to use for opening the log index file. */
  PSI_file_key m_key_file_log_index;
  /** The instrumentation key to use for opening a log cache file. */
  PSI_file_key m_key_file_log_cache;
  /** The instrumentation key to use for opening a log index cache file. */
  PSI_file_key m_key_file_log_index_cache;

  /* POSIX thread objects are inited by init_pthread_objects() */
  mysql_mutex_t LOCK_index;
  mysql_mutex_t LOCK_commit;
  mysql_mutex_t LOCK_sync;
  mysql_mutex_t LOCK_binlog_end_pos;
  mysql_mutex_t LOCK_xids;
  mysql_cond_t update_cond;

  std::atomic<my_off_t> atomic_binlog_end_pos;
  ulonglong bytes_written;
  IO_CACHE index_file;
  char index_file_name[FN_REFLEN];
  /*
    crash_safe_index_file is temp file used for guaranteeing
    index file crash safe when master server restarts.
  */
  IO_CACHE crash_safe_index_file;
  char crash_safe_index_file_name[FN_REFLEN];
  /*
    purge_file is a temp file used in purge_logs so that the index file
    can be updated before deleting files from disk, yielding better crash
    recovery. It is created on demand the first time purge_logs is called
    and then reused for subsequent calls. It is cleaned up in cleanup().
  */
  IO_CACHE purge_index_file;
  char purge_index_file_name[FN_REFLEN];
  /*
     The max size before rotation (usable only if log_type == LOG_BIN: binary
     logs and relay logs).
     For a binlog, max_size should be max_binlog_size.
     For a relay log, it should be max_relay_log_size if this is non-zero,
     max_binlog_size otherwise.
     max_size is set in init(), and dynamically changed (when one does SET
     GLOBAL MAX_BINLOG_SIZE|MAX_RELAY_LOG_SIZE) by fix_max_binlog_size and
     fix_max_relay_log_size).
  */
  ulong max_size;

  // current file sequence number for load data infile binary logging
  uint file_id;

  /* pointer to the sync period variable, for binlog this will be
     sync_binlog_period, for relay log this will be
     sync_relay_log_period
  */
  uint *sync_period_ptr;
  uint sync_counter;

  mysql_cond_t m_prep_xids_cond;
  std::atomic<int32> m_atomic_prep_xids{0};

  /**
    Increment the prepared XID counter.
   */
  void inc_prep_xids(THD *thd);

  /**
    Decrement the prepared XID counter.

    Signal m_prep_xids_cond if the counter reaches zero.
   */
  void dec_prep_xids(THD *thd);

  int32 get_prep_xids() { return m_atomic_prep_xids; }

  inline uint get_sync_period() { return *sync_period_ptr; }

 public:
  /*
    This is used to start writing to a new log file. The difference from
    new_file() is locking. new_file_without_locking() does not acquire
    LOCK_log.
  */
  int new_file_without_locking(
      Format_description_log_event *extra_description_event);

 private:
  int new_file_impl(bool need_lock,
                    Format_description_log_event *extra_description_event);

  bool open(PSI_file_key log_file_key, const char *log_name,
            const char *new_name, uint32 new_index_number);
  bool init_and_set_log_file_name(const char *log_name, const char *new_name,
                                  uint32 new_index_number);
  int generate_new_name(char *new_name, const char *log_name,
                        uint32 new_index_number = 0);

 public:
  const char *generate_name(const char *log_name, const char *suffix,
                            char *buff);
  bool is_open() { return atomic_log_state != LOG_CLOSED; }

  /* This is relay log */
  bool is_relay_log;

  uint8 checksum_alg_reset;  // to contain a new value when binlog is rotated
  /*
    Holds the last seen in Relay-Log FD's checksum alg value.
    The initial value comes from the slave's local FD that heads
    the very first Relay-Log file. In the following the value may change
    with each received master's FD_m.
    Besides to be used in verification events that IO thread receives
    (except the 1st fake Rotate, see @c Master_info:: checksum_alg_before_fd),
    the value specifies if/how to compute checksum for slave's local events
    and the first fake Rotate (R_f^1) coming from the master.
    R_f^1 needs logging checksum-compatibly with the RL's heading FD_s.

    Legends for the checksum related comments:

    FD     - Format-Description event,
    R      - Rotate event
    R_f    - the fake Rotate event
    E      - an arbirary event

    The underscore indexes for any event
    `_s'   indicates the event is generated by Slave
    `_m'   - by Master

    Two special underscore indexes of FD:
    FD_q   - Format Description event for queuing   (relay-logging)
    FD_e   - Format Description event for executing (relay-logging)

    Upper indexes:
    E^n    - n:th event is a sequence

    RL     - Relay Log
    (A)    - checksum algorithm descriptor value
    FD.(A) - the value of (A) in FD
  */
  binary_log::enum_binlog_checksum_alg relay_log_checksum_alg;

  MYSQL_BIN_LOG(uint *sync_period, bool relay_log = false);
  ~MYSQL_BIN_LOG() override;

  void set_psi_keys(
      PSI_mutex_key key_LOCK_index, PSI_mutex_key key_LOCK_commit,
      PSI_mutex_key key_LOCK_commit_queue, PSI_mutex_key key_LOCK_done,
      PSI_mutex_key key_LOCK_flush_queue, PSI_mutex_key key_LOCK_log,
      PSI_mutex_key key_LOCK_binlog_end_pos, PSI_mutex_key key_LOCK_sync,
      PSI_mutex_key key_LOCK_sync_queue, PSI_mutex_key key_LOCK_xids,
      PSI_cond_key key_COND_done, PSI_cond_key key_COND_flush_queue,
      PSI_cond_key key_update_cond, PSI_cond_key key_prep_xids_cond,
      PSI_file_key key_file_log, PSI_file_key key_file_log_index,
      PSI_file_key key_file_log_cache, PSI_file_key key_file_log_index_cache) {
    m_key_COND_done = key_COND_done;
    m_key_COND_flush_queue = key_COND_flush_queue;

    m_key_LOCK_commit_queue = key_LOCK_commit_queue;
    m_key_LOCK_done = key_LOCK_done;
    m_key_LOCK_flush_queue = key_LOCK_flush_queue;
    m_key_LOCK_sync_queue = key_LOCK_sync_queue;

    m_key_LOCK_index = key_LOCK_index;
    m_key_LOCK_log = key_LOCK_log;
    m_key_LOCK_binlog_end_pos = key_LOCK_binlog_end_pos;
    m_key_LOCK_commit = key_LOCK_commit;
    m_key_LOCK_sync = key_LOCK_sync;
    m_key_LOCK_xids = key_LOCK_xids;
    m_key_update_cond = key_update_cond;
    m_key_prep_xids_cond = key_prep_xids_cond;
    m_key_file_log = key_file_log;
    m_key_file_log_index = key_file_log_index;
    m_key_file_log_cache = key_file_log_cache;
    m_key_file_log_index_cache = key_file_log_index_cache;
  }

 public:
  /** Manage the MTS dependency tracking */
  Transaction_dependency_tracker m_dependency_tracker;

  /**
    Find the oldest binary log that contains any GTID that
    is not in the given gtid set.

    @param[out] binlog_file_name the file name of oldest binary log found
    @param[in]  gtid_set the given gtid set
    @param[out] first_gtid the first GTID information from the binary log
                file returned at binlog_file_name
    @param[out] errmsg the error message outputted, which is left untouched
                if the function returns false
    @return false on success, true on error.
  */
  bool find_first_log_not_in_gtid_set(char *binlog_file_name,
                                      const Gtid_set *gtid_set,
                                      Gtid *first_gtid, const char **errmsg);

  /**
    Reads the set of all GTIDs in the binary/relay log, and the set
    of all lost GTIDs in the binary log, and stores each set in
    respective argument.

    @param gtid_set Will be filled with all GTIDs in this binary/relay
    log.
    @param lost_groups Will be filled with all GTIDs in the
    Previous_gtids_log_event of the first binary log that has a
    Previous_gtids_log_event. This is requested to binary logs but not
    to relay logs.
    @param verify_checksum If true, checksums will be checked.
    @param need_lock If true, LOCK_log, LOCK_index, and
    global_sid_lock->wrlock are acquired; otherwise they are asserted
    to be taken already.
    @param [out] trx_parser  This will be used to return the actual
    relaylog transaction parser state because of the possibility
    of partial transactions.
    @param [out] partial_trx If a transaction was left incomplete
    on the relaylog, its GTID information should be returned to be
    used in the case of the rest of the transaction be added to the
    relaylog.
    @param is_server_starting True if the server is starting.
    @return false on success, true on error.
  */
  bool init_gtid_sets(Gtid_set *gtid_set, Gtid_set *lost_groups,
                      bool verify_checksum, bool need_lock,
                      Transaction_boundary_parser *trx_parser,
                      Gtid_monitoring_info *partial_trx,
                      bool is_server_starting = false);

  void set_previous_gtid_set_relaylog(Gtid_set *previous_gtid_set_param) {
    assert(is_relay_log);
    previous_gtid_set_relaylog = previous_gtid_set_param;
  }
  /**
    If the thread owns a GTID, this function generates an empty
    transaction and releases ownership of the GTID.

    - If the binary log is disabled for this thread, the GTID is
      inserted directly into the mysql.gtid_executed table and the
      GTID is included in @@global.gtid_executed.  (This only happens
      for DDL, since DML will save the GTID into table and release
      ownership inside ha_commit_trans.)

    - If the binary log is enabled for this thread, an empty
      transaction consisting of GTID, BEGIN, COMMIT is written to the
      binary log, the GTID is included in @@global.gtid_executed, and
      the GTID is added to the mysql.gtid_executed table on the next
      binlog rotation.

    This function must be called by any committing statement (COMMIT,
    implicitly committing statements, or Xid_log_event), after the
    statement has completed execution, regardless of whether the
    statement updated the database.

    This logic ensures that an empty transaction is generated for the
    following cases:

    - Explicit empty transaction:
      SET GTID_NEXT = 'UUID:NUMBER'; BEGIN; COMMIT;

    - Transaction or DDL that gets completely filtered out in the
      slave thread.

    @param thd The committing thread

    @retval 0 Success
    @retval nonzero Error
  */
  int gtid_end_transaction(THD *thd);
  /**
    Re-encrypt previous existent binary/relay logs as below.
      Starting from the next to last entry on the index file, iterating
      down to the first one:
        - If the file is encrypted, re-encrypt it. Otherwise, skip it.
        - If failed to open the file, report an error.

    @retval False Success
    @retval True  Error
  */
  bool reencrypt_logs();

 private:
  std::atomic<enum_log_state> atomic_log_state{LOG_CLOSED};

  /* The previous gtid set in relay log. */
  Gtid_set *previous_gtid_set_relaylog;

  int open(const char *opt_name) override { return open_binlog(opt_name); }

  /**
    Enter a stage of the ordered commit procedure.

    Entering is stage is done by:

    - Atomically entering a queue of THD objects (which is just one for
      the first phase).

    - If the queue was empty, the thread is the leader for that stage
      and it should process the entire queue for that stage.

    - If the queue was not empty, the thread is a follower and can go
      waiting for the commit to finish.

    The function will lock the stage mutex if the calling thread was designated
    leader for the phase.

    @param[in] thd    Session structure
    @param[in] stage  The stage to enter
    @param[in] queue  Thread queue for the stage
    @param[in] leave_mutex  Mutex that will be released when changing stage
    @param[in] enter_mutex  Mutex that will be taken when changing stage

    @retval true  In case this thread did not become leader, the function
                  returns true *after* the leader has completed the commit
                  on its behalf, so the thread should continue doing the
                  thread-local processing after the commit
                  (i.e. call finish_commit).

    @retval false The thread is the leader for the stage and should do
                  the processing.
  */
  bool change_stage(THD *thd, Commit_stage_manager::StageID stage, THD *queue,
                    mysql_mutex_t *leave_mutex, mysql_mutex_t *enter_mutex);
  std::pair<int, my_off_t> flush_thread_caches(THD *thd);
  int flush_cache_to_file(my_off_t *flush_end_pos);
  int finish_commit(THD *thd);
  std::pair<bool, bool> sync_binlog_file(bool force);
  void process_commit_stage_queue(THD *thd, THD *queue);
  void process_after_commit_stage_queue(THD *thd, THD *first);

  /**
    Set thread variables used while flushing a transaction.

    @param[in] thd  thread whose variables need to be set
    @param[in] all   This is @c true if this is a real transaction commit, and
                 @c false otherwise.
    @param[in] skip_commit
                 This is @c true if the call to @c ha_commit_low should
                 be skipped (it is handled by the caller somehow) and @c
                 false otherwise (the normal case).
  */
  void init_thd_variables(THD *thd, bool all, bool skip_commit);

  /**
    Fetch and empty BINLOG_FLUSH_STAGE and COMMIT_ORDER_FLUSH_STAGE flush queues
    and flush transactions to the disk, and unblock threads executing slave
    preserve commit order.

    @param[in] check_and_skip_flush_logs
                 if false then flush prepared records of transactions to the log
                 of storage engine.
                 if true then flush prepared records of transactions to the log
                 of storage engine only if COMMIT_ORDER_FLUSH_STAGE queue is
                 non-empty.

    @return Pointer to the first session of the BINLOG_FLUSH_STAGE stage queue.
  */
  THD *fetch_and_process_flush_stage_queue(
      const bool check_and_skip_flush_logs = false);

  /**
    Execute the flush stage.

    @param[out] total_bytes_var Pointer to variable that will be set to total
                                number of bytes flushed, or NULL.

    @param[out] rotate_var Pointer to variable that will be set to true if
                           binlog rotation should be performed after releasing
                           locks. If rotate is not necessary, the variable will
                           not be touched.

    @param[out] out_queue_var  Pointer to the sessions queue in flush stage.

    @return Error code on error, zero on success
  */
  int process_flush_stage_queue(my_off_t *total_bytes_var, bool *rotate_var,
                                THD **out_queue_var);

  /**
    Flush and commit the transaction.

    This will execute an ordered flush and commit of all outstanding
    transactions and is the main function for the binary log group
    commit logic. The function performs the ordered commit in four stages.

    Pre-condition: transactions should have called ha_prepare_low, using
                   HA_IGNORE_DURABILITY, before entering here.

    Stage#0 implements slave-preserve-commit-order for applier threads that
    write the binary log. i.e. it forces threads to enter the queue in the
    correct commit order.

    The stage#1 flushes the caches to the binary log and under
    LOCK_log and marks all threads that were flushed as not pending.

    The stage#2 syncs the binary log for all transactions in the group.

    The stage#3 executes under LOCK_commit and commits all transactions in
    order.

    There are three queues of THD objects: one for each stage.
    The Commit_order_manager maintains it own queue and its own order for the
    commit. So Stage#0 doesn't maintain separate StageID.

    When a transaction enters a stage, it adds itself to a queue. If the queue
    was empty so that this becomes the first transaction in the queue, the
    thread is the *leader* of the queue. Otherwise it is a *follower*. The
    leader will do all work for all threads in the queue, and the followers
    will wait until the last stage is finished.

    Stage 0 (SLAVE COMMIT ORDER):
    1. If slave-preserve-commit-order and is slave applier worker thread, then
       waits until its turn to commit i.e. till it is on the top of the queue.
    2. When it reaches top of the queue, it signals next worker in the commit
       order queue to awake.

    Stage 1 (FLUSH):
    1. Sync the engines (ha_flush_logs), since they prepared using non-durable
       settings (HA_IGNORE_DURABILITY).
    2. Generate GTIDs for all transactions in the queue.
    3. Write the session caches for all transactions in the queue to the binary
       log.
    4. Increment the counter of prepared XIDs.

    Stage 2 (SYNC):
    1. If it is time to sync, based on the sync_binlog option, sync the binlog.
    2. If sync_binlog==1, signal dump threads that they can read up to the
       position after the last transaction in the queue

    Stage 3 (COMMIT):
    This is performed by each thread separately, if binlog_order_commits=0.
    Otherwise by the leader does it for all threads.
    1. Call the after_sync hook.
    2. update the max_committed counter in the dependency_tracker
    3. call ha_commit_low
    4. Call the after_commit hook
    5. Update gtids
    6. Decrement the counter of prepared transactions

    If the binary log needs to be rotated, it is done after this. During
    rotation, it takes a lock that prevents new commit groups from executing the
    flush stage, and waits until the counter of prepared transactions becomes 0,
    before it creates the new file.

    @param[in] thd Session to commit transaction for
    @param[in] all This is @c true if this is a real transaction commit, and
                   @c false otherwise.
    @param[in] skip_commit
                   This is @c true if the call to @c ha_commit_low should
                   be skipped and @c false otherwise (the normal case).
  */
  int ordered_commit(THD *thd, bool all, bool skip_commit = false);
  void handle_binlog_flush_or_sync_error(THD *thd, bool need_lock_log,
                                         const char *message);
  bool do_write_cache(Binlog_cache_storage *cache,
                      class Binlog_event_writer *writer);
  void report_binlog_write_error();

 public:
  int open_binlog(const char *opt_name);
  void close() override;
  enum_result commit(THD *thd, bool all) override;
  int rollback(THD *thd, bool all) override;
  bool truncate_relaylog_file(Master_info *mi, my_off_t valid_pos);
  int prepare(THD *thd, bool all) override;
#if defined(MYSQL_SERVER)

  void update_thd_next_event_pos(THD *thd);
  int flush_and_set_pending_rows_event(THD *thd, Rows_log_event *event,
                                       bool is_transactional);

#endif /* defined(MYSQL_SERVER) */
  void add_bytes_written(ulonglong inc) { bytes_written += inc; }
  void reset_bytes_written() { bytes_written = 0; }
  void harvest_bytes_written(Relay_log_info *rli, bool need_log_space_lock);
  void set_max_size(ulong max_size_arg);
  void signal_update() {
    DBUG_TRACE;
    mysql_cond_broadcast(&update_cond);
    return;
  }

  void update_binlog_end_pos(bool need_lock = true);
  void update_binlog_end_pos(const char *file, my_off_t pos);

  int wait_for_update(const struct timespec *timeout);

 public:
  void init_pthread_objects();
  void cleanup();
  /**
    Create a new binary log.
    @param log_name Name of binlog
    @param new_name Name of binlog, too. todo: what's the difference
    between new_name and log_name?
    @param max_size_arg The size at which this binlog will be rotated.
    @param null_created_arg If false, and a Format_description_log_event
    is written, then the Format_description_log_event will have the
    timestamp 0. Otherwise, it the timestamp will be the time when the
    event was written to the log.
    @param need_lock_index If true, LOCK_index is acquired; otherwise
    LOCK_index must be taken by the caller.
    @param need_sid_lock If true, the read lock on global_sid_lock
    will be acquired.  Otherwise, the caller must hold the read lock
    on global_sid_lock.
    @param extra_description_event The master's FDE to be written by the I/O
    thread while creating a new relay log file. This should be NULL for
    binary log files.
    @param new_index_number The binary log file index number to start from
    after the RESET MASTER TO command is called.
  */
  bool open_binlog(const char *log_name, const char *new_name,
                   ulong max_size_arg, bool null_created_arg,
                   bool need_lock_index, bool need_sid_lock,
                   Format_description_log_event *extra_description_event,
                   uint32 new_index_number = 0);
  bool open_index_file(const char *index_file_name_arg, const char *log_name,
                       bool need_lock_index);
  /* Use this to start writing a new log file */
  int new_file(Format_description_log_event *extra_description_event);

  bool write_event(Log_event *event_info);
  bool write_cache(THD *thd, class binlog_cache_data *cache_data,
                   class Binlog_event_writer *writer);
  /**
    Assign automatic generated GTIDs for all commit group threads in the flush
    stage having gtid_next.type == AUTOMATIC_GTID.

    @param first_seen The first thread seen entering the flush stage.
    @return Returns false if succeeds, otherwise true is returned.
  */
  bool assign_automatic_gtids_to_flush_group(THD *first_seen);
  bool write_transaction(THD *thd, binlog_cache_data *cache_data,
                         Binlog_event_writer *writer);

  /**
     Write a dml into statement cache and then flush it into binlog. It writes
     Gtid_log_event and BEGIN, COMMIT automatically.

     It is aimed to handle cases of "background" logging where a statement is
     logged indirectly, like "DELETE FROM a_memory_table". So don't use it on
     any normal statement.

     @param[in] thd  the THD object of current thread.
     @param[in] stmt the DML statement.
     @param[in] stmt_len the length of the DML statement.
     @param[in] sql_command the type of SQL command.

     @return Returns false if succeeds, otherwise true is returned.
  */
  bool write_dml_directly(THD *thd, const char *stmt, size_t stmt_len,
                          enum enum_sql_command sql_command);

  void report_cache_write_error(THD *thd, bool is_transactional);
  bool check_write_error(const THD *thd);
  bool write_incident(THD *thd, bool need_lock_log, const char *err_msg,
                      bool do_flush_and_sync = true);
  bool write_incident(Incident_log_event *ev, THD *thd, bool need_lock_log,
                      const char *err_msg, bool do_flush_and_sync = true);
  bool write_event_to_binlog(Log_event *ev);
  bool write_event_to_binlog_and_flush(Log_event *ev);
  bool write_event_to_binlog_and_sync(Log_event *ev);
  void start_union_events(THD *thd, query_id_t query_id_param);
  void stop_union_events(THD *thd);
  bool is_query_in_union(THD *thd, query_id_t query_id_param);

  bool write_buffer(const char *buf, uint len, Master_info *mi);
  bool write_event(Log_event *ev, Master_info *mi);

 private:
  bool after_write_to_relay_log(Master_info *mi);

 public:
  void make_log_name(char *buf, const char *log_ident);
  bool is_active(const char *log_file_name);
  int remove_logs_from_index(LOG_INFO *linfo, bool need_update_threads);
  int rotate(bool force_rotate, bool *check_purge);
  void purge();
  int rotate_and_purge(THD *thd, bool force_rotate);

  bool flush();
  /**
     Flush binlog cache and synchronize to disk.

     This function flushes events in binlog cache to binary log file,
     it will do synchronizing according to the setting of system
     variable 'sync_binlog'. If file is synchronized, @c synced will
     be set to 1, otherwise 0.

     @param[in] force if true, ignores the 'sync_binlog' and synchronizes the
     file.

     @retval 0 Success
     @retval other Failure
  */
  bool flush_and_sync(const bool force = false);
  int purge_logs(const char *to_log, bool included, bool need_lock_index,
                 bool need_update_threads, ulonglong *decrease_log_space,
                 bool auto_purge);
  int purge_logs_before_date(time_t purge_time, bool auto_purge);
  int set_crash_safe_index_file_name(const char *base_file_name);
  int open_crash_safe_index_file();
  int close_crash_safe_index_file();
  int add_log_to_index(uchar *log_file_name, size_t name_len,
                       bool need_lock_index);
  int move_crash_safe_index_file_to_index_file(bool need_lock_index);
  int set_purge_index_file_name(const char *base_file_name);
  int open_purge_index_file(bool destroy);
  bool is_inited_purge_index_file();
  int close_purge_index_file();
  int sync_purge_index_file();
  int register_purge_index_entry(const char *entry);
  int register_create_index_entry(const char *entry);
  int purge_index_entry(THD *thd, ulonglong *decrease_log_space,
                        bool need_lock_index);
  bool reset_logs(THD *thd, bool delete_only = false);
  void close(uint exiting, bool need_lock_log, bool need_lock_index);

  // iterating through the log index file
  int find_log_pos(LOG_INFO *linfo, const char *log_name, bool need_lock_index);
  int find_next_log(LOG_INFO *linfo, bool need_lock_index);
  int find_next_relay_log(char log_name[FN_REFLEN + 1]);
  int get_current_log(LOG_INFO *linfo, bool need_lock_log = true);
  int raw_get_current_log(LOG_INFO *linfo);
  uint next_file_id();
  /**
    Retrieves the contents of the index file associated with this log object
    into an `std::list<std::string>` object. The order held by the index file is
    kept.

    @param need_lock_index whether or not the lock over the index file should be
                           acquired inside the function.

    @return a pair: a function status code; a list of `std::string` objects with
            the content of the log index file.
  */
  std::pair<int, std::list<std::string>> get_log_index(
      bool need_lock_index = true);
  inline char *get_index_fname() { return index_file_name; }
  inline char *get_log_fname() { return log_file_name; }
  const char *get_name() const { return name; }
  inline mysql_mutex_t *get_log_lock() { return &LOCK_log; }
  inline mysql_mutex_t *get_commit_lock() { return &LOCK_commit; }
  inline mysql_cond_t *get_log_cond() { return &update_cond; }
  inline Binlog_ofile *get_binlog_file() { return m_binlog_file; }

  inline void lock_index() { mysql_mutex_lock(&LOCK_index); }
  inline void unlock_index() { mysql_mutex_unlock(&LOCK_index); }
  inline IO_CACHE *get_index_file() { return &index_file; }

  /**
    Function to report the missing GTIDs.

    This function logs the missing transactions on master to its error log
    as a warning. If the missing GTIDs are too long to print in a message,
    it suggests the steps to extract the missing transactions.

    This function also informs slave about the GTID set sent by the slave,
    transactions missing on the master and few suggestions to recover from
    the error. This message shall be wrapped by
    ER_MASTER_FATAL_ERROR_READING_BINLOG on slave and will be logged as an
    error.

    This function will be called from mysql_binlog_send() function.

    @param slave_executed_gtid_set     GTID set executed by slave
    @param errmsg                      Pointer to the error message
  */
  void report_missing_purged_gtids(const Gtid_set *slave_executed_gtid_set,
                                   const char **errmsg);

  /**
    Function to report the missing GTIDs.

    This function logs the missing transactions on master to its error log
    as a warning. If the missing GTIDs are too long to print in a message,
    it suggests the steps to extract the missing transactions.

    This function also informs slave about the GTID set sent by the slave,
    transactions missing on the master and few suggestions to recover from
    the error. This message shall be wrapped by
    ER_MASTER_FATAL_ERROR_READING_BINLOG on slave and will be logged as an
    error.

    This function will be called from find_first_log_not_in_gtid_set()
    function.

    @param previous_gtid_set           Previous GTID set found
    @param slave_executed_gtid_set     GTID set executed by slave
    @param errmsg                      Pointer to the error message
  */
  void report_missing_gtids(const Gtid_set *previous_gtid_set,
                            const Gtid_set *slave_executed_gtid_set,
                            const char **errmsg);
  static const int MAX_RETRIES_FOR_DELETE_RENAME_FAILURE = 5;
  /*
    It is called by the threads (e.g. dump thread, applier thread) which want
    to read hot log without LOCK_log protection.
  */
  my_off_t get_binlog_end_pos() const {
    mysql_mutex_assert_not_owner(&LOCK_log);
    return atomic_binlog_end_pos;
  }
  mysql_mutex_t *get_binlog_end_pos_lock() { return &LOCK_binlog_end_pos; }
  void lock_binlog_end_pos() { mysql_mutex_lock(&LOCK_binlog_end_pos); }
  void unlock_binlog_end_pos() { mysql_mutex_unlock(&LOCK_binlog_end_pos); }

  /**
    Deep copy global_sid_map and gtid_executed.
    Both operations are done under LOCK_commit and global_sid_lock
    protection.

    @param[out] sid_map  The Sid_map to which global_sid_map will
                         be copied.
    @param[out] gtid_set The Gtid_set to which gtid_executed will
                         be copied.

    @return the operation status
      @retval 0      OK
      @retval !=0    Error
  */
  int get_gtid_executed(Sid_map *sid_map, Gtid_set *gtid_set);

  /*
    True while rotating binlog, which is caused by logging Incident_log_event.
  */
  bool is_rotating_caused_by_incident;
};

上面的注释也非常多，其中在BINLOG类中有一个函数process_flush_stage_queue，它的注释写得非常清楚，分为三步来把日志数据从队列中刷新到盘中文件。它调用了ha_flush_logs（fetch_and_process_flush_stage_queue->ha_flush_log->flush_handlerton->hton->flush_logs），而这个函数最终调用的是下面初始化的日志函数,日志在日志文件生成前先刷入到缓冲文件，然后其它线程定时写入文件并刷盘，最终会调用fsync异步写入：

/** Initialize the InnoDB storage engine plugin.
@param[in,out]	p	InnoDB handlerton
@return error code
@retval 0 on success */
static int innodb_init(void *p) {
  DBUG_TRACE;

  acquire_plugin_services();

  handlerton *innobase_hton = (handlerton *)p;
  innodb_hton_ptr = innobase_hton;
......
  //日志函数被赋予函数指针
  innobase_hton->flush_logs = innobase_flush_logs;
......

为了优化硬盘IO操作，MYSQL提供了一个组提交的机制，在代码中可以看到：

bool ha_flush_logs(bool binlog_group_flush) {
  if (plugin_foreach(nullptr, flush_handlerton, MYSQL_STORAGE_ENGINE_PLUGIN,
                     static_cast<void *>(&binlog_group_flush))) {
    return true;
  }
  return false;
}

这样就可以再往上回推调用栈：

在MYSQL_BIN_LOG类中，binlog_xa_commit_or_rollback->commit->ordered_commit再调用process_flush_stage_queue，然后在上层Service中再调用这个commit函数来提交相关的日志，包括回滚等操作。
而日志的API（handler_api.h or .cc）中：

Commit and flush binlog from cache to binlog file */
void handler_binlog_commit(
   /*==================*/
   void *my_thd,   /*!< in: THD* */
   void *my_table) /*!< in: TABLE structure */
{
 THD *thd = static_cast<THD *>(my_thd);

 if (tc_log) {
   tc_log->commit(thd, true);
 }
 trans_commit_stmt(thd);
}

临时表的操作结束也需要这个动作（close_temporary_tables）。其它如THD中的cleanup以及删除表等操作都会调用这个函数。
既然向上看到了接口层，开始真正分析日志的写入代码：

/** Flush InnoDB redo logs to the file system.
@param[in]	hton			InnoDB handlerton
@param[in]	binlog_group_flush	true if we got invoked by binlog
group commit during flush stage, false in other cases.
@return false */
static bool innobase_flush_logs(handlerton *hton, bool binlog_group_flush) {
  DBUG_TRACE;
  assert(hton == innodb_hton_ptr);

  if (srv_read_only_mode) {
    return false;
  }

  /* If !binlog_group_flush, we got invoked by FLUSH LOGS or similar.
  Else, we got invoked by binlog group commit during flush stage. */

  if (binlog_group_flush && srv_flush_log_at_trx_commit == 0) {
    /* innodb_flush_log_at_trx_commit=0
    (write and sync once per second).
    Do not flush the redo log during binlog group commit. */

    /* This could be unsafe if we grouped at least one DDL transaction,
    and we removed !trx->ddl_must_flush from condition which is checked
    inside trx_commit_complete_for_mysql() when we decide if we could
    skip the flush. */
    return false;
  }

  /* Signal and wait for all GTIDs to persist on disk. */
  if (!binlog_group_flush) {
    auto &gtid_persistor = clone_sys->get_gtid_persistor();
    gtid_persistor.wait_flush(true, true, nullptr);
  }

  /* Flush the redo log buffer to the redo log file.
  Sync it to disc if we are in FLUSH LOGS, or if
  innodb_flush_log_at_trx_commit=1
  (write and sync at each commit). */
  log_buffer_flush_to_disk(!binlog_group_flush ||
                           srv_flush_log_at_trx_commit == 1);

  return false;
}

Mysql新推出的GTID主从复制方式，众说纷纭，这里限于水平不敢置喙。其实就是为了分布式中的唯一事物ID的处理。在上面的代码中也可以看到，在组处理时，先处理一下这个ID。最后调用日志写入函数：


#include <cstring>

#include "mach0data.h"
#include "os0file.h"
#include "srv0mon.h"
#include "srv0srv.h"
#include "ut0crc32.h"

#ifdef UNIV_LOG_LSN_DEBUG
#include "mtr0types.h"
#endif /* UNIV_LOG_LSN_DEBUG */

/** @name Log blocks */

/** @{ */

inline bool log_block_get_flush_bit(const byte *log_block) {
  if (LOG_BLOCK_FLUSH_BIT_MASK &
      mach_read_from_4(log_block + LOG_BLOCK_HDR_NO)) {
    return (true);
  }

  return (false);
}

inline void log_block_set_flush_bit(byte *log_block, bool value) {
  uint32_t field = mach_read_from_4(log_block + LOG_BLOCK_HDR_NO);

  ut_a(field != 0);

  if (value) {
    field = field | LOG_BLOCK_FLUSH_BIT_MASK;
  } else {
    field = field & ~LOG_BLOCK_FLUSH_BIT_MASK;
  }

  mach_write_to_4(log_block + LOG_BLOCK_HDR_NO, field);
}

inline bool log_block_get_encrypt_bit(const byte *log_block) {
  if (LOG_BLOCK_ENCRYPT_BIT_MASK &
      mach_read_from_2(log_block + LOG_BLOCK_HDR_DATA_LEN)) {
    return (true);
  }

  return (false);
}

inline void log_block_set_encrypt_bit(byte *log_block, ibool val) {
  uint32_t field;

  field = mach_read_from_2(log_block + LOG_BLOCK_HDR_DATA_LEN);

  if (val) {
    field = field | LOG_BLOCK_ENCRYPT_BIT_MASK;
  } else {
    field = field & ~LOG_BLOCK_ENCRYPT_BIT_MASK;
  }

  mach_write_to_2(log_block + LOG_BLOCK_HDR_DATA_LEN, field);
}

inline uint32_t log_block_get_hdr_no(const byte *log_block) {
  return (~LOG_BLOCK_FLUSH_BIT_MASK &
          mach_read_from_4(log_block + LOG_BLOCK_HDR_NO));
}

inline void log_block_set_hdr_no(byte *log_block, uint32_t n) {
  ut_a(n > 0);
  ut_a(n < LOG_BLOCK_FLUSH_BIT_MASK);
  ut_a(n <= LOG_BLOCK_MAX_NO);

  mach_write_to_4(log_block + LOG_BLOCK_HDR_NO, n);
}

inline uint32_t log_block_get_data_len(const byte *log_block) {
  return (mach_read_from_2(log_block + LOG_BLOCK_HDR_DATA_LEN));
}

inline void log_block_set_data_len(byte *log_block, ulint len) {
  mach_write_to_2(log_block + LOG_BLOCK_HDR_DATA_LEN, len);
}

inline uint32_t log_block_get_first_rec_group(const byte *log_block) {
  return (mach_read_from_2(log_block + LOG_BLOCK_FIRST_REC_GROUP));
}

inline void log_block_set_first_rec_group(byte *log_block, uint32_t offset) {
  mach_write_to_2(log_block + LOG_BLOCK_FIRST_REC_GROUP, offset);
}

inline uint32_t log_block_get_checkpoint_no(const byte *log_block) {
  return (mach_read_from_4(log_block + LOG_BLOCK_CHECKPOINT_NO));
}

inline void log_block_set_checkpoint_no(byte *log_block, uint64_t no) {
  mach_write_to_4(log_block + LOG_BLOCK_CHECKPOINT_NO, (uint32_t)no);
}

inline uint32_t log_block_convert_lsn_to_no(lsn_t lsn) {
  return ((uint32_t)(lsn / OS_FILE_LOG_BLOCK_SIZE) % LOG_BLOCK_MAX_NO + 1);
}

inline uint32_t log_block_calc_checksum(const byte *log_block) {
  return (log_checksum_algorithm_ptr.load()(log_block));
}

inline uint32_t log_block_calc_checksum_crc32(const byte *log_block) {
  return (ut_crc32(log_block, OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE));
}

inline uint32_t log_block_calc_checksum_none(const byte *log_block) {
  return (LOG_NO_CHECKSUM_MAGIC);
}

inline uint32_t log_block_get_checksum(const byte *log_block) {
  return (mach_read_from_4(log_block + OS_FILE_LOG_BLOCK_SIZE -
                           LOG_BLOCK_CHECKSUM));
}

inline void log_block_set_checksum(byte *log_block, uint32_t checksum) {
  mach_write_to_4(log_block + OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_CHECKSUM,
                  checksum);
}

inline void log_block_store_checksum(byte *log_block) {
  log_block_set_checksum(log_block, log_block_calc_checksum(log_block));
}

/** @} */

#ifndef UNIV_HOTBACKUP
/** @return consistent sn value for locked state */
static inline sn_t log_get_sn(const log_t &log) {
  const sn_t sn = log.sn.load();
  if ((sn & SN_LOCKED) != 0) {
    return log.sn_locked.load();
  } else {
    return sn;
  }
}

inline bool log_needs_free_check(const log_t &log) {
  const sn_t sn = log_get_sn(log);
  return (sn > log.free_check_limit_sn.load());
}

inline bool log_needs_free_check() { return (log_needs_free_check(*log_sys)); }

#ifdef UNIV_DEBUG
/** Performs debug checks to validate some of the assumptions. */
void log_free_check_validate();
#endif /* UNIV_DEBUG */

/** Call this function before starting a mini-transaction.  It will check
for space in the redo log. It assures there is at least
concurrency_safe_free_margin.  If the space is not available, this will
wait until it is. Therefore it is important that the caller does not hold
any latch that may be called by the page cleaner or log flush process.
This includes any page block or file space latch. */
inline void log_free_check() {
  log_t &log = *log_sys;

  ut_d(log_free_check_validate());

  /** We prefer to wait now for the space in log file, because now
  are not holding any latches of dirty pages. */

  if (log_needs_free_check(log)) {
    /* We need to wait, because the concurrency margin could be violated
    if we let all threads to go forward after making this check now.

    The waiting procedure is rather unlikely to happen for proper my.cnf.
    Therefore we extracted the code to seperate function, to make the
    inlined log_free_check() small. */

    log_free_check_wait(log);
  }
}

constexpr inline lsn_t log_translate_sn_to_lsn(lsn_t sn) {
  return (sn / LOG_BLOCK_DATA_SIZE * OS_FILE_LOG_BLOCK_SIZE +
          sn % LOG_BLOCK_DATA_SIZE + LOG_BLOCK_HDR_SIZE);
}

inline lsn_t log_translate_lsn_to_sn(lsn_t lsn) {
  /* Calculate sn of the beginning of log block, which contains
  the provided lsn value. */
  const sn_t sn = lsn / OS_FILE_LOG_BLOCK_SIZE * LOG_BLOCK_DATA_SIZE;

  /* Calculate offset for the provided lsn within the log block.
  The offset includes LOG_BLOCK_HDR_SIZE bytes of block's header. */
  const uint32_t diff = lsn % OS_FILE_LOG_BLOCK_SIZE;

  if (diff < LOG_BLOCK_HDR_SIZE) {
    /* The lsn points to some bytes inside the block's header.
    Return sn for the beginning of the block. Note, that sn
    values don't enumerate bytes of blocks' headers, so the
    value of diff does not matter at all. */
    return (sn);
  }

  if (diff > OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE) {
    /* The lsn points to some bytes inside the block's footer.
    Return sn for the beginning of the next block. Note, that
    sn values don't enumerate bytes of blocks' footer, so the
    value of diff does not matter at all. */
    return (sn + LOG_BLOCK_DATA_SIZE);
  }

  /* Add the offset but skip bytes of block's header. */
  return (sn + diff - LOG_BLOCK_HDR_SIZE);
}

#endif /* !UNIV_HOTBACKUP */

inline bool log_lsn_validate(lsn_t lsn) {
  const uint32_t offset = lsn % OS_FILE_LOG_BLOCK_SIZE;

  return (lsn >= LOG_START_LSN && offset >= LOG_BLOCK_HDR_SIZE &&
          offset < OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE);
}

#ifndef UNIV_HOTBACKUP

/** @return total capacity of log files in bytes. */
inline uint64_t log_get_file_capacity(const log_t &log) {
  return (log.files_real_capacity);
}

inline lsn_t log_get_lsn(const log_t &log) {
  return log_translate_sn_to_lsn(log_get_sn(log));
}

inline lsn_t log_get_checkpoint_lsn(const log_t &log) {
  return (log.last_checkpoint_lsn.load());
}

inline lsn_t log_get_checkpoint_age(const log_t &log) {
  const lsn_t last_checkpoint_lsn = log.last_checkpoint_lsn.load();

  const lsn_t current_lsn = log_get_lsn(log);

  if (current_lsn <= last_checkpoint_lsn) {
    /* Writes or reads have been somehow reordered.
    Note that this function does not provide any lock,
    and does not assume any lock existing. Therefore
    the calculated result is already outdated when the
    function is finished. Hence, we might assume that
    this time we calculated age = 0, because checkpoint
    lsn is close to current lsn if such race happened. */
    return (0);
  }

  return (current_lsn - last_checkpoint_lsn);
}

inline void log_buffer_flush_to_disk(bool sync) {
  log_buffer_flush_to_disk(*log_sys, sync);
}

#if defined(UNIV_HOTBACKUP) && defined(UNIV_DEBUG)
/** Print a log file header.
@param[in]	block	pointer to the log buffer */
UNIV_INLINE
void meb_log_print_file_hdr(byte *block) {
  ib::info(ER_IB_MSG_626) << "Log file header:"
                          << " format "
                          << mach_read_from_4(block + LOG_HEADER_FORMAT)
                          << " pad1 "
                          << mach_read_from_4(block + LOG_HEADER_PAD1)
                          << " start_lsn "
                          << mach_read_from_8(block + LOG_HEADER_START_LSN)
                          << " creator '" << block + LOG_HEADER_CREATOR << "'"
                          << " checksum " << log_block_get_checksum(block);
}
#endif /* UNIV_HOTBACKUP && UNIV_DEBUG */

inline lsn_t log_buffer_ready_for_write_lsn(const log_t &log) {
  return (log.recent_written.tail());
}

inline lsn_t log_buffer_dirty_pages_added_up_to_lsn(const log_t &log) {
  return (log.recent_closed.tail());
}

inline lsn_t log_buffer_flush_order_lag(const log_t &log) {
  return (log.recent_closed.capacity());
}

inline bool log_write_to_file_requests_are_frequent(uint64_t interval) {
  return (interval < 1000); /* 1ms */
}

inline bool log_write_to_file_requests_are_frequent(const log_t &log) {
  return (log_write_to_file_requests_are_frequent(
      log.write_to_file_requests_interval.load(std::memory_order_relaxed)));
}

inline bool log_writer_is_active() {
  return (srv_thread_is_active(srv_threads.m_log_writer));
}

inline bool log_write_notifier_is_active() {
  return (srv_thread_is_active(srv_threads.m_log_write_notifier));
}

inline bool log_flusher_is_active() {
  return (srv_thread_is_active(srv_threads.m_log_flusher));
}

inline bool log_flush_notifier_is_active() {
  return (srv_thread_is_active(srv_threads.m_log_flush_notifier));
}

inline bool log_checkpointer_is_active() {
  return (srv_thread_is_active(srv_threads.m_log_checkpointer));
}

#endif /* !UNIV_HOTBACKUP */

在上面的log0log.ic文件中，可以看到log_buffer_flush_to_disk函数，它其实就真正调入到了innobase/log下的相关文件了，这个日志文件夹下的日志文件，其实就是对相关的OS系统的文件操作进行了一次封装。

void log_buffer_flush_to_disk(log_t &log, bool sync) {
  ut_a(!srv_read_only_mode);
  ut_a(!recv_recovery_is_on());

  const lsn_t lsn = log_get_lsn(log);

  log_write_up_to(log, lsn, sync);
}
Wait_stats log_write_up_to(log_t &log, lsn_t end_lsn, bool flush_to_disk) {
  ut_a(!srv_read_only_mode);

  /* If we were updating log.flushed_to_disk_lsn while parsing redo log
  during recovery, we would have valid value here and we would not need
  to explicitly exit because of the recovery. However we do not update
  the log.flushed_to_disk during recovery (it is zero).

  On the other hand, when we apply log records during recovery, we modify
  pages and update their oldest/newest_modification. The modified pages
  become dirty. When size of the buffer pool is too small, some pages
  have to be flushed from LRU, to reclaim a free page for a next read.

  When flushing such dirty pages, we notice that newest_modification != 0,
  so the redo log has to be flushed up to the newest_modification, before
  flushing the page. In such case we end up here during recovery.

  Note that redo log is actually flushed, because changes to the page
  are caused by applying the redo. */

  if (recv_no_ibuf_operations) {
    /* Recovery is running and no operations on the log files are
    allowed yet, which is implicitly deduced from the fact, that
    still ibuf merges are disallowed. */
    return (Wait_stats{0});
  }

  /* We do not need to have exact numbers and we do not care if we
  lost some increments for heavy workload. The value only has usage
  when it is low workload and we need to discover that we request
  redo write or flush only from time to time. In such case we prefer
  to avoid spinning in log threads to save on CPU power usage. */
  log.write_to_file_requests_total.store(
      log.write_to_file_requests_total.load(std::memory_order_relaxed) + 1,
      std::memory_order_relaxed);

  ut_a(end_lsn != LSN_MAX);

  ut_a(end_lsn % OS_FILE_LOG_BLOCK_SIZE == 0 ||
       end_lsn % OS_FILE_LOG_BLOCK_SIZE >= LOG_BLOCK_HDR_SIZE);

  ut_a(end_lsn % OS_FILE_LOG_BLOCK_SIZE <=
       OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE);

  ut_ad(end_lsn <= log_get_lsn(log));

  Wait_stats wait_stats{0};
  bool interrupted = false;

retry:
  if (log.writer_threads_paused.load(std::memory_order_acquire)) {
    /* the log writer threads are paused not to waste CPU resource. */
    wait_stats +=
        log_self_write_up_to(log, end_lsn, flush_to_disk, &interrupted);

    if (UNIV_UNLIKELY(interrupted)) {
      /* the log writer threads might be working. retry. */
      goto retry;
    }

    DEBUG_SYNC_C("log_flushed_by_self");
    return (wait_stats);
  }

  /* the log writer threads are working for high concurrency scale */
  if (flush_to_disk) {
    if (log.flushed_to_disk_lsn.load() >= end_lsn) {
      DEBUG_SYNC_C("log_flushed_by_writer");
      return (wait_stats);
    }

    if (srv_flush_log_at_trx_commit != 1) {
      /* We need redo flushed, but because trx != 1, we have
      disabled notifications sent from log_writer to log_flusher.

      The log_flusher might be sleeping for 1 second, and we need
      quick response here. Log_writer avoids waking up log_flusher,
      so we must do it ourselves here.

      However, before we wake up log_flusher, we must ensure that
      log.write_lsn >= lsn. Otherwise log_flusher could flush some
      data which was ready for lsn values smaller than end_lsn and
      return to sleeping for next 1 second. */

      if (log.write_lsn.load() < end_lsn) {
        wait_stats += log_wait_for_write(log, end_lsn, &interrupted);
      }
    }

    /* Wait until log gets flushed up to end_lsn. */
    wait_stats += log_wait_for_flush(log, end_lsn, &interrupted);

    if (UNIV_UNLIKELY(interrupted)) {
      /* the log writer threads might be paused. retry. */
      goto retry;
    }

    DEBUG_SYNC_C("log_flushed_by_writer");
  } else {
    if (log.write_lsn.load() >= end_lsn) {
      return (wait_stats);
    }

    /* Wait until log gets written up to end_lsn. */
    wait_stats += log_wait_for_write(log, end_lsn, &interrupted);

    if (UNIV_UNLIKELY(interrupted)) {
      /* the log writer threads might be paused. retry. */
      goto retry;
    }
  }

  return (wait_stats);
}

slave中，也有一套类似的线程同步日志的机制，这里就不再介绍，大家可以看看rpl_slave_commit_order_manager.h中的相关函数和操作内容即可明白。
需要说明的是：
binlog有三种模式，即statement模式、Row模式（5.1.5开始）和Mixed模式，三种方式各有优缺点：Row修改行信息，但是数据量过于大；Statement模式（5.1.5版本以前）需要记录上下文的信息，但不记录上下文，而且对新技术的兼容需要考虑；最后一个Mixed模式（5.1.8）就是Mysql在发现上述二者的各自问题后，做一个妥协，对一些关键的地方在各自中都提供了对方的更完善的方式。

通过上面的分析，初步了解了一个日志落盘的过程，但是一些细节，如上面提到的2PC，XA、GTID等都被简单忽略了，在以后的代码分析过程中，会不断的将其展开，有一些问题，可能分析的有偏差，希望大家以帮助文档为基准，如果发现有什么不妥的地方，欢迎批评指正，不胜感激！

四、总结

在以后的学习和分析中，会发现，无论是在大型的数据库还是分布式系统亦或是操作系统中，日志的角色越来越重要，没有日志，整个系统的安全性和可检测性几乎无法谈起。认真学习几个相关的大型软件或者框架的日志系统，一定会有一种另外一个维度看问题的眼光。
“横看成岭侧成峰，远近高低各不同”，古人诚不我欺！
努力吧！归来的少年！
在这里插入图片描述

fpcc

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
mysql源码解读——数据到文件之日志

一、数据库数据库，本身有一个库，那应该是有自己的库的管理方式，这种传统的关系型数据库是如何把数据存储到硬盘上呢？文件的组织形式有哪些呢？MySql数据库一般要有两类文件落盘，一类是日志型文件，一类是真正的数据文件，在数据文件中，又包含索引数据和真实数据。这也是经常提到聚簇索引和非聚簇索引的主要原因，因为这两种索引，在硬盘中的存储方式是不一样的。前者本身就是数据的顺序集合，后者是需要二次再通过指针才能查找数据且数据是无法和索引同顺序的。或者这样说可能更明白，由于索引是通过B+树构造的，聚簇索引本身的叶子节
复制链接

扫一扫

专栏目录