Integrated gem5 + GPGPU-Sim Simulator

原文地址:http://cpu-gpu-sim.ece.wisc.edu/

Electrical and Computer Engineering

The Fountain
Integrated gem5 + GPGPU-Sim Simulator

Last modified on: 04/16/2015 02:52:36 CST



 Overview:

The integrated gem5 + GPGPU-Sim simulator is a CPU-GPU simulator for heterogeneous computing.

The integrated simulator infrastructure is developed based on gem5 and GPGPU-Sim. The gem5 and GPGPU-Sim run as two separate processes and communicate through shared memory in the Linux OS.

gem5 is used to model the CPU cores and memory subsystem in which a MEOSI directory coherence protocol is supported by Ruby, and GPGPU-Sim is used to model the streaming multiprocessors (SMs) and on-chip interconnect within the GPU. The memory subsystem and DRAM model at GPGPU-Sim side are completely removed, leaving only a set of request and response queues per memory controller (MC); GPGPU-Sim communicates with the memory subsystem of gem5 to service its memory accesses through shared memory structures.

Lock-Step Execution

In order to ensure that both simulators are running in lock-step, gem5 provides periodic SM-blocking ticks and memory ticks (configured through GPU core and memory clock multipliers) to GPGPU-Sim. gem5 issues one blocking tick for all SMs, while one memory tick per MC in GPGPU-Sim. gem5 triggers SMs or MCs in GPGPU-Sim by setting a flag in shared memory structure; gem5 then blocks itself until GPGPU-Sim completes the execution of a GPU cycle and resets the flag to resume gem5.

Shared Memory System

At GPGPU-Sim side, on each memory tick received for a particular MC, it pushes a pending request, from its internal queue into the request queue in shared memory structure in FIFO order. Similarly, it pops pending read responses, if there are any, in FIFO order from the response queue in shared memory structure and pushes them into its internal response queue to be returned to an appropriate SM.

At gem5 side, once a pending memory tick is reset by GPGPU-Sim, gem5 resumes to execute its portion of memory tick. At front-end, an arbiter is used to select a request between CPU and GPU to push into front-end queue for scheduling. If GPU wins the arbitration, it pops a GPU memory request present in the shared memory. Currently, FR-FCFS policy is applied on front-end queue to schedule a request and push into back-end command queue. At back-end, it scans the command queue and queries the DRAM banks to issue commands. When a read/write command is issued, the request is pushed into a response queue, with the ready time set according to CAS latency. Any response that is intended for GPU will be popped from gem5's response queue when it's ready, and pushed into the response queue in shared memory structure.

Note that above procedures happen in reverse-order in code to model real-hardware behaivor.


 Simulation Flow:

  1. gem5 starts with AtomicSimpleCPU to create a checkpoint right before Region of Interest (ROI).
  2. gem5 restores from the checkpoint with detailed O3CPU and Ruby memory system.
  3. The integration-related code in gem5 is activated in "dumpresetStats" pseudo-instruction if "activate_gsim" option is set. So a "dumpresetStats" pseudo-instruction is inserted at the beginning of ROI.
  4. gem5 & GPGPU-Sim run separately and communicate with each other through shared memory.
  5. If GPU simulation finishes first, GPGPU-Sim will notify gem5 to stop providing ticks; if CPU simulation finishes first, gem5 will disable "m5exit" pseudo-instruction and thus the rcS script should keep trying "exit". See Running Simulator for details.

 Package Layout:

  • /alpha-ruby/codebase_alpha.tar.bz2

    gem5 and GPGPU-Sim package. This version is tested with Alpha ISA and Ruby memory system.

  • /arm-classic/codebase_arm.tar.bz2

    gem5 and GPGPU-Sim package. This version is tested with ARM ISA and classic memory system. (may need some more test)

  • /utils-alpha/disk-image-alpha.tar.bz2

    ALpha full system files, pre-compiled Linux kernel, PAL/Console binaries and a file system from gem5 site.

    A set of pre-compiled OpenMP binaries of Rodinia benchmark suite is installed under /rodinia/bin.ckpt/ with ROI tagged by m5 pseudo-instructions.

  • /utils-alpha/run_alpha_example.tar

    A sample simulation directory to run Hotspot benchmark.

    • Simulation configuration files for GPGPU-Sim: gpgpusim.config and icnt_config.txt.
    • CUDA binary: hotspot.
    • rcS script for gem5 simulation: hotspot.rcS.
    • Keys for shared memory structures: keys.txt. See Running Simulator for details.
    • A run script: run_alpha.sh. Users may go through the script to change various paths accordingly.

  • /utils-alpha/bench_build.tar

    Hooks tool and hotspot source code.

    • hooks/: It provides a C interface to gem5 pseudo-instructions, so that benchmark program could interact with the simulator. E.g. creating checkpoint, dumping and resetting statistics
    • hotspot_omp/: The OpenMP version of Hotspot benchmark with pseudo-instructions inserted. Search for "wangh" for the modifications

    Users may go through the Makefile to set the path to Alpha cross-compiler. A pre-compiled Alpha cross-compiler can be downloaded from gem5 site.

  • /utils-arm/disk-image-arm.tar.bz2

    ARM full system files from gem5 site. A simple C test program (vector add) is installed under /wangh/bin/test.

    Linux 3.3 VExpress_EMM kernel is used to support 1GB memory.

  • /utils-arm/run_arm_example.tar

    A sample simulation directory. Note that arm-classic package use classic memory, so ignore the ruby stuff throughput this page.

    • Simulation configuration files for GPGPU-Sim: gpgpusim.config and icnt_config.txt.
    • CUDA binary: hotspot.
    • rcS script for gem5 simulation: test.rcS.
    • Keys for shared memory structures: keys.txt. See Running Simulator for details.
    • A run script: run_arm.sh. Users may go through the script to change various paths accordingly.


 Build Simulator:

gem5 and GPGPU-Sim are still built separately, and there is no additional requirement or step needed. For quick start, below is some brief instructions. Please refer to gem5 site and README file in GPGPU-Sim distribution for detailed instructions.

  • gem5:

    See gem5 site for dependencies.

    Type "scons build/ALPHA_FS_MOESI_CMP_directory/gem5.opt" in gem5_integ/ for Alpha-Ruby version.

    Type "scons build/ARM/gem5.opt" in gem5/ for ARM - Classic version.

  • GPGPU-Sim:

    Set CUDA_INSTALL_PATH environment variable; The simulator is built on an older version of GPGPU-Sim, so CUDA Toolkit v3.1 is recommended.

    Type "make" in gppgu-sim/.


 Running Simulator:

  1. Configure

    • Set the path to disk-image/ in /gem5_integ/configs/Syspath.py.
    • Set the image file name in /gem5_integ/configs/common/Benchmarks.py.

  2. Prepare

    • Have CUDA binary, GPGPU-Sim configuration files, and rcS file for gem5 in simulation directory.
    • Have a file called keys.txt in simulation directory. See /run_example/keys.txt as reference.
      • In order to run multiple simulations on one machine, simulator reads the keys for shared memory structure from a file named  keys.txt in working directory.

  3. Run

    Please refer to the run script in run_example/ directory to help with a quick start. Below is a brief explanation of the script.

    1. Set the paths to input data of Rodinia package for CUDA binary, gem5 simulator binary.
    2. Clear the shared memory segments in case previous simulation did not exit correctly and thus had shared memory left in the system.
    3. Run gem5 with AtomicSimpleCPU and classic memory system to create a checkpoint right before ROI.
    4. Run gem5 with O3CPU and Ruby memory system to restore from the checkpoint, and set the "activate_gsim" option and GPU clock multiplier.
    5. Wait several seconds to ensure shared memory creation completes, and then launch GPGPU-Sim simulation.


 Configuration Notices:

  • Frequencies:

    CPU frequency is set by gem5 option "--clock"; Memory frequency is set through "--mem_clock_multiplier". GPU frequency is set through clock multiplier option "--gpu_l2_clock", the frequency values set in GPGPU-Sim configurations files are deprecated;

    For example, --clock=4.0GHz, --mem_clock_multiplier=5.0, --gpu_l2_clock=10.0 sets the CPU frequency to 4GHz, memory frequency to 800MHz GPU L2 cache frequency to 400MHz and the core frequency is half of L2 as 200MHz.

    Note that in GPGPU-Sim the width of the pipeline is equal to warp size. To compensate for this, SMs run at 1/4 the frequency reported on product specification. For example, 1.3GHz shader clock rate of NVIDIA's Quadro FX 5800 corresponds to 325MHz SIMT core clock in GPGPU-Sim. See GPGPU-Sim Manual for details.

  • Memory Settings:

    Number of Memory channels is set by "--num-dirs" on gem5 side, and "gpgpu_n_mem" in configurations file on GPGPU-Sim side.

    Note that num-dirs, numa-high-bit, ranks_per_dimm, dimms_per_channel, mem_addr_map_mask on gem5 side and the gpgpu_n_mem, gpgpu_mem_addr_mapping, nbk of gpgpu_dram_timing_opt in GPGPU-Sim configuration file should be consistent.

    *numa-high-bit denotes the position of highest channel bit in DRAM address map. Search for "m_numa_bit_high" in /gem5_integ/src/mem/ruby/system/MemoryControl.cc for details.

    The provided example (Alpha) includes above verbose settings in command line as a reference.



 Publication:

If you use our Integrated gem5+GPGPU-Sim Simulator in your work, please cite:



 Contact:

For any technical questions, please send an email to hwang223 AT wisc.edu

Personal Homepage



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
gem5学习基础完整版,介绍了gem5环境的安装,以及一些基本概念。 gem5仿真器是用于计算机系统体系结构研究的模块化平台,涵盖系统级体系结构以及处理器微体系结构。1、多个可互换的CPU型号。 gem5提供了四种基于解释的CPU模型:简单的单CPI CPU; 有序CPU的详细模型和无序CPU的详细模型。 这些CPU模型使用通用的高级ISA描述。 此外,gem5具有基于KVM的CPU,该CPU使用虚拟化来加速仿真。 2、完全集成的GPU模型,可以执行真实计算机ISA,并支持与主机CPU共享的虚拟内存。 3、NoMali GPU模型。 gem5带有集成的NoMali GPU模型,该模型与Linux和Android GPU驱动程序堆栈兼容,因此无需进行软件渲染。 NoMali GPU不产生任何输出,但可以确保以CPU为中心的实验产生代表性的结果。 4、事件驱动的内存系统。 gem5具有详细的,事件驱动的内存系统,包括高速缓存,交叉开关,探听过滤器以及快速而准确的DRAM控制器模型,用于捕获当前和新兴内存的影响,例如内存。 LPDDR3 / 4/5,DDR3 / 4,GDDR5,HBM1 / 2/3,HMC,WideIO1 / 2。 可以灵活地布置组件,例如,以具有异构存储器的复杂的多级非均匀高速缓存层次结构来建模。 5、基于跟踪的CPU模型,可播放弹性跟踪,这些跟踪是由附着到乱序CPU模型的探针生成的依赖项和定时注释的跟踪。 跟踪CPU模型的重点是以快速,合理的方式而不是使用详细的CPU模型来实现内存系统(高速缓存层次结构,互连和主内存)的性能探索。 6、异构和异构多核。 可以将CPU模型和缓存组合到任意拓扑中,从而创建同构异构的多核系统。 MOESI侦听缓存一致性协议可保持缓存一致性。 7、多种ISA支持。 gem5将ISA语义与其CPU模型解耦,从而实现对多个ISA的有效支持。 目前gem5支持Alpha,ARM,SPARC,MIPS,POWER,RISC-V和x86 ISA。 有关更多信息,请参见支持的体系结构。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值