GPU架构基础之 L1 data cache & Unified L2 cache IN Fermi Arch

最新推荐文章于 2024-07-26 10:42:33 发布

__DARK__

最新推荐文章于 2024-07-26 10:42:33 发布

阅读量2.2k

点赞数

分类专栏： GPU 体系架构文章标签：缓存 gpu

本文链接：https://blog.csdn.net/dark5669/article/details/60753046

版权

GPU 体系架构专栏收录该内容

24 篇文章 1 订阅

订阅专栏

NVIDIA Parallel DataCache TM with Configurable L1 and Unified L2 Cache
Working with hundreds of GPU computing
applications from various industries, we learned
that while Shared memory benefits many
problems, it is not appropriate for all problems.
Some algorithms map naturally to Shared
memory, others require a cache, while others
require a combination of both. The optimal
memory hierarchy should offer the benefits of
both Shared memory and cache, and allow the
programmer a choice over its partitioning. The
Fermi memory hierarchy adapts to both types of
program behavior.

1.英伟达 fermi 架构，，为了既满足那些 shared mem friendly ，又满足 cache friendly 的程序，还有这两者都需要的程序，提出了 L1/shared mem 可配置的架构。

Adding a true cache hierarchy for load / store
operations presented significant
challenges. Traditional GPU architectures
support a read-only ‘‘load’’ path for texture
operations and a write-only ‘‘export’’ path for
pixel data output. However, this approach is
poorly suited to executing general purpose C or
C++ thread programs that expect reads and
writes to be ordered. As one example: spilling a
register operand to memory and then reading it
back creates a read after write hazard; if the
read and write paths are separate, it may be necessary to explicitly flush the entire write /
‘‘export’’ path before it is safe to issue the read, and any caches on the read path would not be
coherent with respect to the write data.

2.传统的架构中支持 read-only load path for texture 和一条 wrtie-only “export” path for pixel data output.但是这不能满足通用的gpgpu 的C/C++这种期望 read 和 write 是有顺序的程序。
eg.举个例子，如果说，一个溢出的寄存器，在写操作之后在，产生一个读请求的话，如果这两条 path 是分开的，那么就会产生not coherence 的问题。

The Fermi architecture addresses this challenge by implementing a single unified memory
request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2
cache that services all operations (load, store and texture). The per-SM L1 cache is
configurable to support both shared memory and caching of local and global memory
operations. The 64 KB memory can be configured as either 48 KB of Shared memory with 16
KB of L1 cache, or 16 KB of Shared memory with 48 KB of L1 cache. When configured with
48 KB of shared memory, programs that make extensive use of shared memory (such as
electrodynamic simulations) can perform up to three times faster. For programs whose memory
accesses are not known beforehand, the 48 KB L1 cache configuration offers greatly improved
performance over direct access to DRAM.

__DARK__

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
GPU架构基础之 L1 data cache & Unified L2 cache IN Fermi Arch

NVIDIA Parallel DataCache TM with Configurable L1 and Unified L2 Cache Working with hundreds of GPU computing applications from various industries, we learned that while Shared memory benefits many
复制链接

扫一扫