用 C++ 制作一个随机 DAG 生成器

Je1lyfish

已于 2024-04-08 15:41:27 修改

阅读量484

点赞数 1

文章标签： c++ 开发语言图论网络算法

于 2023-08-13 20:49:35 首次发布

本文链接：https://blog.csdn.net/Kovnt/article/details/132233622

版权

在写最大流问题的时候，因为需要生成大批量的随机网络流，所以花了点时间来制作一个随机 DAG 生成器。

我们都知道一个流网络都具有以下两个性质：

存在源点和汇点；
是一个有向图。

并且由于解题需求，生成的流网络中不应该存在反向边，这一点要求显然让我所需要的流网络在结构上更像一个有向无环图（DAG）；实际上确实是这样。

那么我只需要编写一个 DAG 生成器，并且手动为这个图加上源点和汇点，最后为每条边分配流量，就可以达到我的要求。

很容易想到的就是，我们可以首先生成数个随机的结点，然后在这些结点之间逐对地添加一条有向边，每次建边后判断此时的图是否有环即可。

但这个方法有个很大的劣处：生成一个完整 DAG 的速度完全取决于人品，~~这个方法的时间复杂度大概就是 O(rp)~~ ；而且目标图的规模越大速度越慢。

所以这里需要一个新的方法生成 DAG。我们都知道图论中有个特殊的排序：拓扑排序，一个能被拓扑排序的图本身一定是一个 DAG。

那我们只需要把结点随机打乱，并定义这个被打乱的结点排序为拓扑排序，就可以利用这个拓扑序在不同结点之间单向地建立联系，进而批量生成多个随机 DAG。

在这里，单向地建立连接被实现为：通过遍历先前定义的、长度为 length 的拓扑序，在当前遍历第 i 个结点时，无重复地从剩下的 length - i - 1 个结点中随机取出 n 个结点，逐一建立 i -> n 的联系。这种方法显然可以避免回环出现。

由于我需要的流网络中，源点和汇点分别位于 DAG 的拓扑序结果的首尾两端，所以我还需要将打乱的队列的首尾替换为已知的源点和汇点。

目的明确后，编写代码就很简单了。这里我把生成器封装为了一个 RandDAG 类，通过调用其中的类方法就能批量生成多个具有唯一源点和汇点的流网络，也可以视作是特殊的 DAG。由于这里生成的数据只和图中的边密切相关，所以结果会被保存为邻接表。

由于这个 DAG 中存在源点和汇点，所以也可以当做 AOE 网使用。

代码的语法标准为 C++11，并且使用了比较多的模板编程技巧，看上去会比较繁琐；请参照注释服用。

/* dag_generator.hpp */
#ifndef __DAG_GENERATOR_H__
  #define __DAG_GENERATOR_H__

#include <string>
#include <vector>
#include <random>
#include <stdexcept>
#include <iostream>

#include <cstdint>
#include <algorithm>
#include <numeric>
#include <type_traits>

/* @brief 随机引擎
 * @tparam RngT `RngT` 需要满足以下两点：
 * 1. 有一个接受 `std::random_device {}()` 的值的构造函数
 * 2. 能被 `std::uniform_int_distribution`
 * 和 `std::uniform_int_distribution` 的
 * `operator()` 函数所接受
 */
template<typename RngT = std::mt19937>
class Rng {
  static constexpr uint64_t reset_limit_ = 20000;
  RngT gen_;
  uint64_t reset_threshold_;
  uint64_t invoke_cnt_;

  void reset_rng_seed() {
    gen_ = RngT( std::random_device {}() );
    invoke_cnt_ = {};
  }

  void invoke_counter( uint64_t times = 1 ) {
    if ( (invoke_cnt_ += times) > reset_threshold_ )
      reset_rng_seed();
  }

protected:
  /// @brief 提供内部随机数引擎的开放接口
  /// @param invoke_times 引擎要被调用的次数
  /// @return 返回内部的随机数引擎引用
  RngT& access_engine( uint64_t invoke_times ) {
    invoke_counter( invoke_times );
    return gen_;
  }

public:
  Rng( uint64_t reset_threshold )
    : reset_threshold_ { reset_threshold }, invoke_cnt_ {} {
    gen_ = RngT( std::random_device {}() );
  }
  Rng() : Rng( reset_limit_ ) {}

  /// @brief 在范围 `[lower, upper]` 内生成一个随机的整型数
  /// @tparam numericT1 要生成的整型随机数的类型
  /// @param upper 范围上限
  /// @param lower 范围下界
  /// @return 若 `lower` >= `upper`，返回 `lower`
  template<typename numericT1, typename numericT2 = typename std::decay<numericT1>::type>
  typename std::enable_if<
    std::is_integral<typename std::decay<numericT1>::type>::value
    , numericT1
  >::type get_number( const numericT1 upper, const numericT2 lower = {} ) {
    if ( upper <= static_cast<numericT1>(lower) )
      return lower;
    invoke_counter();
    return std::uniform_int_distribution<numericT1>( static_cast<numericT1>(lower), upper )(gen_);
  }

  /// @brief 在范围 `[lower, upper]` 内生成一个随机的浮点型数
  /// @tparam numericT1 要生成的浮点型随机数的类型
  /// @param upper 范围上限
  /// @param lower 范围下界
  /// @return 若 `lower` >= `upper`，返回 `lower`
  template<typename numericT1, typename numericT2 = typename std::decay<numericT1>::type>
  typename std::enable_if<
    std::is_floating_point<typename std::decay<numericT1>::type>::value
    , numericT1
  >::type get_number( const numericT1 upper, const numericT2 lower = {} ) {
    if ( upper <= static_cast<numericT1>(lower) )
      return lower;
    invoke_counter();
    return std::uniform_real_distribution<numericT1>( static_cast<numericT1>(lower), upper )(gen_);
  }

  /// @brief 生成一个被随机打乱、范围在 `[first, first + size]` 的整型数组
  /// @tparam numericT 数组的元素类型，必须是整型
  /// @param size 数组的大小
  /// @param first 数组的开始数
  /// @return `size` 为 0 时返回空数组
  template<typename numericT>
  std::vector<numericT> get_array( const size_t size, const numericT first = {} ) {
    static_assert(std::is_integral<typename std::decay<numericT>::type>::value,
      "Rng::get_array: `numericT` must be an integral type");
    if ( size == 0 ) return {};
    auto ret = std::vector<typename std::decay<numericT>::type>( size );
    std::iota( ret.begin(), ret.end(), first );
    std::shuffle( ret.begin(), ret.end(), access_engine( size ) );
    return ret;
  }

  /// @brief 从迭代器划定的范围中无重复地取出 `num` 个元素
  /// @param start 迭代器起始点
  /// @param terminus 迭代器终止点
  /// @param num 需要取出的元素数量
  /// @return 无重复取出的 `num` 个元素
  /// @throw `std::out_of_range` 当 `num` 大于迭代器划定的范围时抛出
  template<typename IterT1, typename IterT2
    , typename EleT = typename std::iterator_traits<IterT2>::value_type>
  std::vector<EleT> get_element( IterT1&& start, IterT2&& terminus, size_t num ) {
    size_t range_length = std::distance( start, terminus );
    if ( num > range_length )
      throw std::out_of_range( "Rng::get_element: `number` must be less than the range of iterators" );
    else if ( num == 0 ) return {};
    else if ( num == range_length )
      return { start, terminus };
    auto indexes = get_array<size_t>( range_length - 1 );
    indexes.resize( num );

    std::vector<EleT> ret;
    for ( auto offset : indexes )
      ret.emplace_back( *std::next( start, offset ) );
    return ret;
  }
};

/// @brief 随机的 DAG
/// @tparam numericT 指定边权的类型，要求是数值类型
/// @tparam RngT 内部获取随机数的随机生成器，要求必须实现 Rng 中的对应接口
template<typename numericT, typename RngT = Rng<>>
class RandDAG : private RngT {
  using DataT = std::vector<std::vector<std::pair<size_t, numericT>>>;

  DataT adj_list_;
  size_t graph_size_;
  size_t arc_num_;
  size_t source_, target_;
  numericT weight_ceiling_;

public:
  using DAGType = DataT;

  /// @throw `std::invalid_argument` `size` 小于 2 时抛出非法参数异常
  RandDAG( size_t size ) : graph_size_ {}, arc_num_ {}, source_ {}, target_ {}, weight_ceiling_ {} {
    if ( size < 2 ) throw std::invalid_argument( "RandDAG::RandDAG: `size` must be greater than 2" );
    graph_size_ = size;
  }

  size_t get_graph_size() const noexcept { return graph_size_; }
  size_t get_arc_num() const noexcept { return arc_num_; }
  size_t get_source() const noexcept { return source_; }
  size_t get_target() const noexcept { return target_; }
  numericT get_max_weight() const noexcept { return weight_ceiling_; }

  /// @brief 以只读语义查看邻接表
  /// @return 不可更改的 DAG 的邻接表
  const DAGType& get_adj_list() const noexcept { return adj_list_; }

  /// @brief 生成一个随机 DAG
  /// @param source 目标 DAG 的源点
  /// @param target 目标 DAG 的汇点
  /// @throw `std::out_of_range` 如果 `source` 或 `target` 的值大于等于矩阵尺寸，或者它们相等，则抛出越界异常
  RandDAG& produce( const size_t source, const size_t target ) {
    if ( source >= graph_size_ || target >= graph_size_
        || source == target ) {
      throw std::out_of_range(
        std::string( "RandDAG::produce: `target` and `source` must be less than `graph_size_`\n"
          "\tbut now `graph_size_` = " ) + std::to_string( graph_size_ )
        + std::string( "\n\t        `target` = " ) + std::to_string( target )
        + std::string( "\n\t        `source` = " ) + std::to_string( source )
      );
    }

    size_t new_arc_num {};
    auto new_adj_list = DataT( graph_size_ ); // 生成拓扑序
    auto topolog_order = this->template get_array<size_t>( graph_size_ );

    // 重整源汇点
    std::iter_swap( std::find( topolog_order.begin(), topolog_order.end(), source ), topolog_order.begin() );
    std::iter_swap( std::find( std::next( topolog_order.begin() ), topolog_order.end(), target ), std::prev( topolog_order.end() ) );

    auto iter = std::next( topolog_order.begin() );
    for ( size_t i = 0; i < topolog_order.size() - 1; ++i ) { // 不处理汇点
      const size_t nodes_num = this->get_number( topolog_order.size() - i - 1 );
      // 从拓扑序中随机无重复选出 nodes_num 个可达结点
      const auto feasible_nodes = this->get_element( iter++, topolog_order.end(), nodes_num );
      new_arc_num += feasible_nodes.size();

      bool next_flag = false;
      for ( const auto node : feasible_nodes ) {
        next_flag = node == topolog_order[i + 1] ? true : false;
        new_adj_list[topolog_order[i]].emplace_back( node, static_cast<numericT>(0) );
      } // 强制让该结点指向下一个结点
      if ( !next_flag ) { // 以避免中间存在某些结点是没有入度的情况，即存在多个源点
        new_adj_list[topolog_order[i]].emplace_back( topolog_order[i + 1], static_cast<numericT>(0) );
        ++new_arc_num;
      }
    }

    arc_num_ = new_arc_num; // 强异常安全保证
    source_ = source; target_ = target;
    adj_list_ = std::move( new_adj_list );

    return *this;
  }

  /// @brief 填充流量，使得生成的 DAG 满足网络流性质
  /// @param max_weight 当前网络流的最大流量
  /// @throw `std::runtime_error` 当源点和汇点相同时，抛出运行期错误
  RandDAG& fill( numericT max_weight ) {
    if ( source_ == target_ )
      throw std::runtime_error( "RandDAG::fill: `source_` and `target_` cannot be same" );
    else if ( max_weight <= 0 )
      throw std::runtime_error( "RandDAG::fill: `max_weight` must be greater than 0" );

    /* 生成一个相对平均的分配方案数组，数组元素为以当前结点为弧尾的边的权重 */
    auto allocation = std::vector<numericT>( adj_list_.size() );
    for ( auto& arcs : adj_list_ ) {
      if ( arcs.empty() ) continue; // 汇点为空
      constexpr numericT coefficient = 4; // 用于调整元素之间的差值大小
      if ( arcs.size() >= static_cast<size_t>(max_weight) )
        std::clog << "Warning:\n\tIn RandDAG::fill, the `max_weight` is less than the number of arcs.\n";

      // 确保分配方案中每个元素分配到的权重都不为 0
      const numericT sum = this->get_number( max_weight, arcs.size() + 1 );
      const numericT ele_per_num = sum / arcs.size();
      const numericT remainder = sum - (arcs.size() * ele_per_num); // 计算余量
      std::fill( allocation.begin(), allocation.end(), ele_per_num );

      std::transform( allocation.begin(), std::next( allocation.begin() + static_cast<size_t>(remainder) ),
        allocation.begin(), []( numericT val ) { return val + 1; } );
      if ( ele_per_num > 1 ) {
        for ( size_t i = 0, tail = allocation.size() - 1; i < allocation.size() / 2; ++i ) {
          const numericT delta = this->get_number( ele_per_num / coefficient );
          allocation[i] += delta; allocation[tail--] -= delta;
        }
      }
      std::shuffle( allocation.begin(), allocation.end(), this->access_engine( allocation.size() ) );
      for ( size_t i = 0; i < arcs.size(); ++i )
        arcs[i].second = allocation[i];
    }
    weight_ceiling_ = max_weight; // 弱异常安全

    return *this;
  }
};

template<typename T>
std::ostream& operator<<( std::ostream& os, const RandDAG<T>& dag )
{
  os << std::to_string( dag.get_graph_size() ) + ' '
    + std::to_string( dag.get_arc_num() ) + ' '
    + std::to_string( dag.get_source() ) + ' '
    + std::to_string( dag.get_target() ) + '\n';

  size_t node_number = 0;
  for ( const auto& arcs : dag.get_adj_list() ) {
    std::string stream_buffer;
    for ( const auto& arc : arcs ) {
      stream_buffer += std::to_string( node_number ) + " "
        + std::to_string( arc.first ) + " "
        + std::to_string( arc.second ) + "\n";
    }
    os << stream_buffer;
    ++node_number;
  }
  os << std::endl;

  return os;
}

#endif /* __DAG_GENERATOR_H__ */

类方法 RandDAG::fill 是我用于填充边权并使之满足流网络定义的方法，可以省略。同时由于 RandDAG 是一个模板类，所以可以生成一些边权为 double 的网络流。

代码块中的 rng 是我对随机数引擎 std::mt19937 的封装，这里也可以改为封装 srand() 和 rand()，只要能产出随机数就行。

此外经过测试，程序的主要性能瓶颈在 I/O 上，有需要的话可以自己封装一个多线程生成器，把邻接表输出交给另一个线程执行。并且随着矩阵规模的扩大，RandDAG 内部的邻接表中的元素数量会增长得非常快，在需要生成超大规模随机 DAG 的情况下可以考虑将内部实现更改为邻接矩阵。

经测试，1000 次最大大小为 1000 的 DAG 生成的峰值内存占用在 5.5MB 左右；100 次最大大小为 10000 的 DAG 峰值内存占用在 540 MB 左右。

以下是一个创建并调用 RandDAG 类对象的示例代码，这段代码会把邻接表输出在程序同目录下的一个文本文件中：

/* data_gen.cpp */
#include <iostream>
#include <fstream>
#include "dag_generator.hpp"
using namespace std;

int main()
{
  Rng<mt19937_64> random; // 随机数引擎可以被随便替换
  string filepath { "./Network_flow.in" };
  ofstream ofs { filepath };
  uint64_t scale = 100, max_matrix_size = 100;
  int max_capacity = 0x1000;
  /* 这里已经手动指定了数据集规模，所以就把下面的语句注释掉 */
  //cout << "Input three numbers of data scale, max matrix size and max capcity:\n>>> ";
  //cin >> scale >> max_matrix_size >> max_capacity;

  /* 生成的数据满足以下格式：
   * 1. 先给出 DAG 的个数 scale
   * 2. 然后在每个 DAG 数据开头给出四个数字，分别指出：DAG 大小、边的数量、源点和汇点序号
   * 3. 接下来的 k 行分别是弧尾序号、弧头序号、这条弧的权重；k 是边的个数
   *
   * 详细信息参阅函数 `std::ostream& operator<<(std::ostream& os, const RandDAG<T>& dag)`
   */

  ofs << scale << endl; // 一共 scale 个网络

  while ( scale-- ) {
    auto size = random.get_number( max_matrix_size, 5 );
    auto list = random.get_array<size_t>( size );
    // 大小为 size、边权类型为 int 的随机 DAG
    ofs << RandDAG<int>( size ).produce( list[0], list[1] ).fill( max_capacity );
  }
  cout << "All DAGs are generated in '" << filepath << "'.\n";
}