GPU 体系架构___DARK__的博客-CSDN博客

GPU 体系架构

关注

文章平均质量分 77

关注数：文章数：24 文章阅读量：82673 文章收藏量：57

作者: DARK

Only in darkness can you see the stars

展开

nvprof tx1 or tx2

nvprof --metrics ipc,gld_transactions,gst_transactions,global_hit_rate,tex_cache_transactions,tex_cache_hit_rate,l2_tex_read_hit_rate,l2_tex_read_transactions,l2_tex_write_transactions,l2_read_transact

原创 2017-08-18 21:42:24 · 731 阅读 · 3 评论
GPU 架构基础之 Concurrent Kernel Execution in Fermi arch & later

Fermi supports concurrent kernel execution, where different kernels of the same application context can execute on the GPU at the same time. Concurrent kernel execution allows programs that execute a

原创 2017-03-07 14:35:42 · 1050 阅读 · 0 评论
PTX ISA 之 Control Flow Instructions

Control Flow InstructionsThe following PTX instructions and syntax are for controlling execution in a PTX program:{} @ bra call ret exit1.1. Control Flow Instructions: {}{} Instruction grouping.作

原创 2017-07-08 16:20:54 · 807 阅读 · 0 评论
PTX ISA 之 cache operator

ld cache operatorst cache operator

原创 2017-04-25 10:20:58 · 775 阅读 · 0 评论
PTX ISA 之同步指令 bar & membar

barBarrier synchronization.Syntaxbar.sync a{, b};bar.arrive a, b;bar.red.popc.u32 d, a{, b}, {!}c;bar.red.op.pred p, a{, b}, {!}c;.op = { .and, .or };http://docs.nvidia.com/cuda/parallel-t

原创 2017-03-07 23:37:00 · 2458 阅读 · 0 评论
L1 Data Cache in Nvidia

Nvidia 架构local dataglobal loadsglobal storefor L1 cachereference white paperFermicachingcachingcachingL1/shared memnot coherentKeplercachingNot cachingNot cachi

原创 2017-03-07 20:11:56 · 425 阅读 · 0 评论
PTX ISA 之 volatile 的用法

字面意思挥发性的，不稳定的用法用于ld/st指令ld.volatile{.ss}.type d, [a]; // load from address ld.volatile{.ss}.vec.type d, [a]; // vector load from addrAn ld.volatile operation is always performed and it will not be r

原创 2017-02-28 23:30:10 · 740 阅读 · 0 评论
解析GPU cache 中读写操作及其事件

基于GPGPU-SIM代码，对读写操作进行解析1.对于cache不应用写回策略的发送读请求/// Read miss handler without writebackvoid baseline_cache::send_read_request(new_addr_type addr, new_addr_type block_addr, unsigned cache_index, mem_fetch

原创 2017-02-12 22:39:45 · 2815 阅读 · 0 评论
GPU benchmark 编译问题

GPU benchmark 编译问题写在前面1.大部分的makefile 只需微调，将 arch 版本改为相应的即可。2.有些需要特殊操作，或者本身有问题的，做一下笔记。问题汇总1.can not find -lcudart问题所在，cudart即 cuda runtime，l 即 library，那么找不到这个库是什么问题呢？路径出错了！一般做法就是在编译时加上：nvcc -L/usr/local

原创 2017-09-30 20:28:44 · 1376 阅读 · 0 评论
Jetson tx2 性能模式工具 nvpmodel

Jetson tx2 CPU性能模式工具nvpmodelJetson tx2 CPU性能模式工具nvpmodelTX2架构图性能模式列表用法举例参考文献Jetson Tegra系统的应用涵盖越来越广，相应用户对性能和功耗的要求也呈现多样化。为此NVIDIA提供一种新的命令行工具，可以方便地让用户配置CPU状态，以最大限度地提高不同场景下的性能和能耗。 Jetson TX2由一个GPU和

原创 2017-12-18 14:15:12 · 12901 阅读 · 0 评论
GPU架构中的半精度fp16与单精度fp32计算

GPU架构中的半精度与单精度计算由于项目原因，我们需要对darknet中卷积层进行优化，然而对于像caffe或者darknet这类深度学习框架来说，都已经将卷积运算转换成了矩阵乘法，从而可以方便调用cublas 库函数和cudnn里tiling 过的矩阵乘。 CUDA在推出7.5的时候提出了可以计算16位浮点数据的新特性。定义了两种新的数据类型half和half2. 之...

原创 2018-04-17 15:51:26 · 26499 阅读 · 0 评论
GPU架构基础之 L1 data cache & Unified L2 cache IN Fermi Arch

NVIDIA Parallel DataCache TM with Configurable L1 and Unified L2 Cache Working with hundreds of GPU computing applications from various industries, we learned that while Shared memory benefits many

原创 2017-03-07 10:11:20 · 2249 阅读 · 0 评论
GPU架构基础之 Unified L1/Texture Cache in Pascal

Unified L1/Texture Cache in PascalLike Maxwell, Pascal combines the functionality of the L1 and texture caches into a unified L1/Texture cache which acts as a coalescing buffer for memory accesses, gat

转载 2017-02-26 15:27:57 · 1121 阅读 · 0 评论
零拷贝问题

Zero copy in TK１and TX1 and TX２TX1 架构图JETSON TK1,TX1,TX2都是CPU-GPU异构架构，共享主存DRAM(最下边的)左上角，四核arm A57 下一个，四核arm A53右边GPU 双核Maxwell arch sm_53 /TX 2 是pascal arch sm_62缓存各管各的，无共享 last level cach...

原创 2017-08-12 17:44:44 · 1115 阅读 · 0 评论
GPGPU-Sim 之 block调度

代码暂存，而后分析unsigned simt_core_cluster::issue_block2core(){ unsigned num_blocks_issued=0; for( unsigned i=0; i < m_config->n_simt_cores_per_cluster; i++ ) { unsigned core = (i+m_cta_issue

原创 2017-06-01 23:32:54 · 526 阅读 · 0 评论
GPGPU-SIM 之单步编译 BENCHMARK

单步编译GPGPU-SIM BENCHMARK有时候直接使用 common.mk make 会出错误，然后如果写的不好，不同CPU架构的话，也整不明白。因此呢，自己就根据规则，自己写一发。最终实现目标，可在 tx1 上执行。 2002 nvcc -c rand.cu -o rand.cu_50.o -arch sm_50 -O2 -g 2003 nvcc -c lbstatic.cu

原创 2017-04-25 10:02:05 · 926 阅读 · 0 评论
GPGPU-SIM 之编译错误 cannot find -lcutil_x86_64 -lshrutil_x86_64

/usr/bin/ld: cannot find -lcutil_x86_64/usr/bin/ld: cannot find -lshrutil_x86_64这个编译错误是我在 GPGPU-Sim 中经常遇到的错误!出现这个错误原因是什么呢？就是因为在编译过程中找不到 libcutil_x86_64.a 和 libshrutil_x86_64.a !为什么找不到呢？？这是因为在 ma

原创 2017-03-30 23:23:13 · 2075 阅读 · 6 评论
GPU架构基础之关于CUDA中线程访存的关系

关于CUDA中线程访存的关系，

原创 2015-11-16 20:25:38 · 964 阅读 · 1 评论
GPGPU-Sim 之 benchmark编译 ispass2009-WP

ispass2009中一共有12个benchmarks，直接编译能用的有9个。WP是（weather forecast）天气预测的意思。这是第十二个，我想用一下，因此单独编译。在WP文件夹中也是有单独的makefile 的。1. $cd ispassbenchmark/WP$make出现如下错误1）第一个错误：gfortran :not found缺少了一

原创 2016-06-01 22:28:25 · 1747 阅读 · 0 评论
NVIDIA Pascal GP100 Unified Memory

这是一篇翻译文档，便于自己理解便在博客上记下来。前言：英伟达在 CUDA 6 中提出了CPU和GPU统一内存的概念，但是实际上，CPU和GPU之间的数据仍然需要通过 PCI-e总线来进行传递，但是在理论上是进步的。今年（2016年）6月份，最新发布的Pascal 架构以及最新的CUDA 8，拓展了CUDA 6中统一内存，在Pascal GP100中，添加了一些更简化的编程和 CPU-GPU内存共

翻译 2016-11-14 12:08:37 · 481 阅读 · 0 评论
GPGPU-Sim 之提高运行benchmark的速度（转载整理）

本博客是基于《大光叔叔的专栏》中《GPGPU-Sim（番外）-如何加快GPGPU-Sim的运行速度》的方案四做的；链接请点击：> http://blog.csdn.net/litdaguang/article/details/50002325对于出入GPGPU-Sim坑的小白们，可能ubuntu 还没用熟呢，就要做各种实验了，可是用官方提供的虚拟机就太耗时间了，幸亏看到了大光的这篇文章，觉得世界

转载 2016-12-13 12:42:53 · 1945 阅读 · 0 评论
GPU 架构基础

1. 费米架构 FERMI架构图SMSM Streaming multi-processors with multiple processing cores Each SM contains 32 processing coresExecutive in a Single Instruction Multi

原创 2016-12-27 22:42:33 · 1827 阅读 · 0 评论
How to caching Global data in on-chip (level 1) cache in Morden GPU

1.Fermi arch因为在CC 2.x（Compute Capability NVIDIA 计算能力）时，L1 Data Cache 还是可用的，我们可以缓存 local 和 global 的数据，不管ld（load 读）或者st （store 写），其默认的操作参数都是cache all 的。ld.ca 和 st.wb 是其默认指令。但是这样的话，SM之间会出现 cache coherenc

原创 2017-01-03 23:21:10 · 749 阅读 · 0 评论
如何查看Jetson TX1/2 CPU和GPU性能使用状态

如何查看Jetson TX1/2 CPU和GPU性能使用状态官方给了一个脚本文件，我们使用超级权限运行即可sudo ～/tegrastats效果如下：RAM 4634/7854MB (lfb 2x512kB) cpu [0%@1112,off,off,0%@1113,0%@1113,0%@1112] EMC 5%@1331 APE 150 VDE 1203 GR3D 0%@...

原创 2018-03-29 11:08:32 · 16383 阅读 · 13 评论

GPU 体系架构

作者: __DARK__

nvprof tx1 or tx2

GPU 架构基础 之 Concurrent Kernel Execution in Fermi arch & later

PTX ISA 之 Control Flow Instructions

PTX ISA 之 cache operator

PTX ISA 之 同步指令 bar & membar

L1 Data Cache in Nvidia

PTX ISA 之 volatile 的用法

解析GPU cache 中读写操作及其事件

GPU benchmark 编译问题

Jetson tx2 性能模式工具 nvpmodel

GPU架构中的半精度fp16与单精度fp32计算

GPU架构基础 之 L1 data cache & Unified L2 cache IN Fermi Arch

GPU架构基础 之 Unified L1/Texture Cache in Pascal

零拷贝问题

GPGPU-Sim 之 block调度

GPGPU-SIM 之 单步编译 BENCHMARK

GPGPU-SIM 之 编译错误 cannot find -lcutil_x86_64 -lshrutil_x86_64

GPU架构基础 之 关于CUDA中线程访存的关系

GPGPU-Sim 之 benchmark编译 ispass2009-WP

NVIDIA Pascal GP100 Unified Memory

GPGPU-Sim 之 提高运行benchmark的速度（转载整理）

GPU 架构基础

How to caching Global data in on-chip (level 1) cache in Morden GPU

如何查看Jetson TX1/2 CPU和GPU性能使用状态

作者: DARK

GPU 架构基础之 Concurrent Kernel Execution in Fermi arch & later

PTX ISA 之同步指令 bar & membar

GPU架构基础之 L1 data cache & Unified L2 cache IN Fermi Arch

GPU架构基础之 Unified L1/Texture Cache in Pascal

GPGPU-SIM 之单步编译 BENCHMARK

GPGPU-SIM 之编译错误 cannot find -lcutil_x86_64 -lshrutil_x86_64

GPU架构基础之关于CUDA中线程访存的关系

GPGPU-Sim 之提高运行benchmark的速度（转载整理）