https://blog.csdn.net/love_hot_girl/article/details/81910692
https://www.cnblogs.com/minggoddess/p/10950349.html
https://www.cnblogs.com/minggoddess/p/10950349.html
https://www.cnblogs.com/minggoddess/p/10950349.html
https://zhuanlan.zhihu.com/p/259760974?utm_source=wechat_session
https://www.cnblogs.com/timlly/p/11471507.html
https://www.sohu.com/a/83561143_119711
https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/LoadandStoreActions.html
https://zhuanlan.zhihu.com/p/112120206
摘抄自:https://www.cnblogs.com/timlly/p/11471507.html#11-为何要了解gpu?
3.2 GPU
Tesla架构
Tesla微观架构总览图如上。下面将阐述它的特性和概念:
拥有7组TPC(Texture/Processor Cluster,纹理处理簇)
每个TPC有两组SM(Stream Multiprocessor,流多处理器)
每个SM包含:
6个SP(Streaming Processor,流处理器)
2个SFU(Special Function Unit,特殊函数单元)
L1缓存、MT Issue(多线程指令获取)、C-Cache(常量缓存)、共享内存
除了TPC核心单元,还有与显存、CPU、系统内存交互的各种部件。
四、GPU运行机制
4.1 GPU渲染总览
Fermi架构的运行机制总览图:
从Fermi开始NVIDIA使用类似的原理架构,使用一个Giga Thread Engine来管理所有正在进行的工作,GPU被划分成多个GPCs(Graphics Processing Cluster), 每个GPC拥有多个SM(SMX, SMM)和一个光栅化引擎(Raster Engine),它们其中有很多的连接,最显著的是Crossbar,它可以连接GPCs和其它功能性模块(例如ROP或其他子系统)。
程序员编写的shader是在SM上完成的。每个SM包含许多为线程执行数学运算的Core(核心)。例如,一个线程可以是顶点或像素着色器调用。这些Core和其它单元由Warp Scheduler驱动,Warp Scheduler管理一组32个线程作为Warp(线程束)并将要执行的指令移交给Dispatch Units。
GPU中实际有多少这些单元(每个GPC有多少个SM,多少个GPC …)取决于芯片配置本身。例如,GM204有4个GPC,每个GPC有4个SM,但Tegra X1有1个GPC和2个SM,它们均采用Maxwell设计。SM设计本身(内核数量,指令单位,调度程序…)也随着时间的推移而发生变化,并帮助使芯片变得如此高效,可以从高端台式机扩展到笔记本电脑移动。
4.4 GPU资源机制
本节将阐述GPU的内存访问、资源管理等机制。
4.4.1 内存架构
部分架构的GPU与CPU类似,也有多级缓存结构:寄存器、L1缓存、L2缓存、GPU显存、系统显存。
它们的存取速度从寄存器到系统内存依次变慢:
由此可见,shader直接访问寄存器、L1、L2缓存还是比较快的,但访问纹理、常量缓存和全局内存非常慢,会造成很高的延迟。
上面的多级缓存结构可被称为“CPU-Style”,还存在GPU-Style的内存架构:
这种架构的特点是ALU多,GPU上下文(Context)多,吞吐量高,依赖高带宽与系统内存交换数据。
4.4.6 显像机制
水平和垂直同步信号
在早期的CRT显示器,电子枪从上到下逐行扫描,扫描完成后显示器就呈现一帧画面。然后电子枪回到初始位置进行下一次扫描。为了同步显示器的显示过程和系统的视频控制器,显示器会用硬件时钟产生一系列的定时信号。
当电子枪换行进行扫描时,显示器会发出一个水平同步信号(horizonal synchronization),简称 HSync
当一帧画面绘制完成后,电子枪回复到原位,准备画下一帧前,显示器会发出一个垂直同步信号(vertical synchronization),简称 VSync。
显示器通常以固定频率进行刷新,这个刷新率就是 VSync 信号产生的频率。虽然现在的显示器基本都是液晶显示屏了,但其原理基本一致。
CPU将计算好显示内容提交至 GPU,GPU 渲染完成后将渲染结果存入帧缓冲区,视频控制器会按照 VSync 信号逐帧读取帧缓冲区的数据,经过数据转换后最终由显示器进行显示。
双缓冲
在单缓冲下,帧缓冲区的读取和刷新都都会有比较大的效率问题,经常会出现相互等待的情况,导致帧率下降。
为了解决效率问题,GPU 通常会引入两个缓冲区,即 双缓冲机制。在这种情况下,GPU 会预先渲染一帧放入一个缓冲区中,用于视频控制器的读取。当下一帧渲染完毕后,GPU 会直接把视频控制器的指针指向第二个缓冲器。
垂直同步
双缓冲虽然能解决效率问题,但会引入一个新的问题。当视频控制器还未读取完成时,即屏幕内容刚显示一半时,GPU 将新的一帧内容提交到帧缓冲区并把两个缓冲区进行交换后,视频控制器就会把新的一帧数据的下半段显示到屏幕上,造成画面撕裂现象:
为了解决这个问题,GPU 通常有一个机制叫做垂直同步(简写也是V-Sync),当开启垂直同步后,GPU 会等待显示器的 VSync 信号发出后,才进行新的一帧渲染和缓冲区更新。这样能解决画面撕裂现象,也增加了画面流畅度,但需要消费更多的计算资源,也会带来部分延迟。
4.5 Shader运行机制
Shader代码也跟传统的C++等语言类似,需要将面向人类的高级语言(GLSL、HLSL、CGSL)通过编译器转成面向机器的二进制指令,二进制指令可转译成汇编代码,以便技术人员查阅和调试。
由高级语言编译成汇编指令的过程通常是在离线阶段执行,以减轻运行时的消耗。
在执行阶段,CPU端将shader二进制指令经由PCI-e推送到GPU端,GPU在执行代码时,会用Context将指令分成若干Channel推送到各个Core的存储空间。
对现代GPU而言,可编程的阶段越来越多,包含但不限于:顶点着色器(Vertex Shader)、曲面细分控制着色器(Tessellation Control Shader)、几何着色器(Geometry Shader)、像素/片元着色器(Fragment Shader)、计算着色器(Compute Shader)、…
这些着色器形成流水线式的并行化的渲染管线。下面将配合具体的例子说明。
下段是计算漫反射的经典代码:
sampler mySamp;
Texture2D<float3> myTex;
float3 lightDir;
float4 diffuseShader(float3 norm, float2 uv)
{
float3 kd;
kd = myTex.Sample(mySamp, uv);
kd *= clamp( dot(lightDir, norm), 0.0, 1.0);
return float4(kd, 1.0);
}
经过编译后成为汇编代码:
<diffuseShader>:
sample r0, v4, t0, s0
mul r3, v0, cb0[0]
madd r3, v1, cb0[1], r3
madd r3, v2, cb0[2], r3
clmp r3, r3, l(0.0), l(1.0)
mul o0, r0, r3
mul o1, r1, r3
mul o2, r2, r3
mov o3, l(1.0)
在执行阶段,以上汇编代码会被GPU推送到执行上下文(Execution Context),然后ALU会逐条获取(Detch)、解码(Decode)汇编指令,并执行它们。
以上示例图只是单个ALU的执行情况,实际上,GPU有几十甚至上百个执行单元在同时执行shader指令:
对于SIMT架构的GPU,汇编指令有所不同,变成了SIMT特定指令代码:
<VEC8_diffuseShader>:
VEC8_sample vec_r0, vec_v4, t0, vec_s0
VEC8_mul vec_r3, vec_v0, cb0[0]
VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3
VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3
VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)
VEC8_mul vec_o0, vec_r0, vec_r3
VEC8_mul vec_o1, vec_r1, vec_r3
VEC8_mul vec_o2, vec_r2, vec_r3
VEC8_mov o3, l(1.0)
并且Context以Core为单位组成共享的结构,同一个Core的多个ALU共享一组Context:
如果有多个Core,就会有更多的ALU同时参与shader计算,每个Core执行的数据是不一样的,可能是顶点、图元、像素等任何数据:
https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
and no offence 冒犯 to the folks who designed said GeForces and Radeons— a few of whom i am friends with and many more i know quite well, but the font-end architecture of a modern discrete IMR GPU is not the most exciting thing in the world.
designed around having plenty of dedicated bandwidth, those GPUs go about the job of painting pixels in a reasonably inefficient way, but one that is conceptually simple and manifests itself in silicon in a similary simple way, which makes it relatively easy for the GPU architect to design, spec and have built by the hardware team.
Immediate Mode Rendering at work
With an IMR you send work to the GPU and it gets drawn straight away. There’s little connection to what else has already been drawn, or will be drawn in the future. You send triangles, you shade them. You rasterise them into pixels, you shade those. You send the rendered pixels to the screen. Triangles in, pixels out, job done! But, crucially the job is done with no context of what’s already happened, or what might happen in the future.
PowerVR GPUs are about as different as they come in that respect, and it’s that which made me take the job here, to figure out how PowerVR’s architects, hardware teams and software folks had made TBDR actually work in real products. My instinct was that TBDRs would be too complex to build so that they’d work well and actually provide a benefit. I had a chance to figure it out and five years later I’m still here, helping figure out how we’ll evolve it in the future, along with the rest of the GPU’s microarchitecture.
As far as the graphics programmer is concerned, PowerVR still looks like triangles in, pixels out, job done. But under the hood something much more exciting is happening. And while the exciting part put me in this chair so I could write about it 5 years later, crucially it’s also that other good E word: efficient!
It always starts with the classic TBDR vs. IMR debate
To help understand why, let’s keep talking about IMRs. One of the biggest things stopping a desktop-class IMR scaling down to fit the power, performance and area budgets of modern embedded application processors is bandwidth. It’s such a scarce resource, even in high-end processors – mostly because of power, area, wiring and packaging limitations, among other things – that you really need to use it as efficiently as possible.
IMRs don’t do that very well, especially when pixel shading. Remember that there are usually a great many more pixels being rendered than the vertices used to build triangles. On top of that, with an IMR pixels are often still shaded despite never being visible on the screen, and that costs large amounts of precious bandwidth and power. Here’s why.
Textures for those pixels need to be sampled, and those pixels need to be written out to memory – and often read back in and written out again! – before being drawn on the screen. While all modern IMRs have means in hardware to try and avoid some of that redundant work, say one building in the background being completely obscured by one drawn closer to you, there are things the application developer can do to effectively disable those mechanisms, such as always drawing the building in the background first.
In our architecture it doesn’t really matter how the application developer draws what’s on the screen. There are exceptions for non-opaque geometry, which the developer still needs to manage, but otherwise we’re submission order independent. That capability is something we’ve had in our hardware since before we were ever an IP company and still made our own standalone PC and console GPUs. You could draw the building in the background first, then the one in the foreground on top, and we’ll never perform pixel shading for the first one, unlike an IMR.
We effectively sort all of the opaque geometry in the GPU, regardless of how and when it was submitted by the application, to figure out the top-most triangles. Sure, if a developer perfectly sorts their geometry then an IMR can get much closer to our efficiency, but that’s not the common case by any means.
PowerVR TBDRs
Think again about all the work that’s saving, especially for modern content: for every pixel shaded there are going to be a non-trivial amount of texture lookups for various things, dozens and sometimes hundreds of ALU cycles spent to run computation on that texture data in order to apply the right effects, which often means writing the pixel out to an intermediate surface to be read back in again in a further rendering pass, and then the pixel needs to be stored in memory at the end of shading, so it can be displayed on screen.
And that’s just one optimisation that we have. So even though we’ve avoided processing completely occluded geometry, there’s still bandwidth saving work we can do at the pixel processing stage. Because we split the screen up into tiles, where we figure out all of the geometry that contributes to the tile so we only process what we need to, and we know exactly how big the tile is (currently 32×32 pixels, but it’s been smaller and even non-square in prior designs), we can build enough on-chip storage to process a few of those tiles at a time, without having to use storage in external memory again until we’ve finished and want to write the final pixels out.
PowerVR GPUs split the screen into tiles
There are secondary benefits to working on screen, region at a time; benefits that other GPUs take advantage of too: because it’s highly likely that a pixel on the screen will share some data with its immediate neighbours, it’s likely that when we move on to processing the neighbouring pixels that we’ve fetched the data into cache and don’t have to wait for another set of external memory accesses, saving bandwidth again. It’s a classic exploitation of spatial locality that’s present in a lot of modern 3D rendering.
So that’s the top-level view of the biggest benefits of a TBDR versus an IMR in terms of processing and (especially) bandwidth efficiency. But how does it actually work in hardware? If you’re not too hardware inclined you can stop here! If you go no further, you’ll still have understood the big top-level benefits of how we go about making best use of the available and very precious bandwidth, throughout rendering on in embedded, low-power systems.
How TBDR works in hardware
For those interested in how things happen in the hardware, let’s talk about the tiler in context of a modern Rogue GPU.
A 3D graphics application starts by telling us where its geometry is in memory, so we ask the GPU to fetch it and perform vertex shading. We have blocks in our GPUs that are responsible for the non-programmable steps of each kind of task type, called the data masters. They do a bunch of different things on behalf of the Universal Shading Cluster or USC (our shading core) to do the fixed function bits of any workload, including fetch data from memory. So because we’re vertex shading, it’s the vertex data master (VDM) that gets involved at this point, to fetch the vertex data from memory based on information provided by the driver.
PowerVR Series7XT is the latest family of Rogue GPUs
The data could be stored in memory as lines, triangles or points. It could be indexed or non-indexed. There are associated shader programs and accompanying data for those programs. The VDM fetches whatever’s needed, using another couple of internally-programmable blocks to help, and emits it all to the USC for vertex shading. The USC runs the shader program and the output vertices are stored on-chip.
They’re then consumed by hardware that performs primitive assembly, certain kinds of culling, and then clipping. If the geometry is back-facing or can be determined to be completely off the screen, it’s culled. All of the remaining on-screen front-facing geometry is sent to be clipped. There’s a fast path here for geometry that doesn’t intersect a clip plane, to let it make onwards progress with no extra processing bar the intersection test. If the geometry intersects with a plane, the clipper generates new geometry so that passed-on vertices are fully on-screen (even though they might be right at the edges). The clipper can do some other cool things, but they’re not too relevant for a big picture explanation like this.
Then we’re on to where a lot of the magic happens, compared to IMRs. We’re obviously aiming to run computation in multiple phases in the hardware, to maximise efficiency and occupancy: one front-end phase to figure out what’s going on with shaded geometry and bin it into tiles, then one phase to consume that binned data, rasterise it and pass it on for pixel shading and final write-out. To keep things as efficient as possible that intermediate acceleration structure between the two main phases has to be as optimal as we can make it.
Clearly it’s a bandwidth cost to create it and read it back, one which our competitors like to pick on when it comes to a competitive advantage they have over us. And it’s true; an IMR doesn’t have to deal with it. But given the bandwidth savings we have in our processing model, creating that acceleration structure – which we call the Parameter Buffer (PB) – before feeding it to our trick rasteriser, we still end up with a huge bandwidth advantage in typical rendering situations, especially complex game-like scenes.
So how do we generate the PB? The clipper outputs a stream of primitives and render target IDs into memory, grouped by render target. Think of it as a container around collections of related geometry. The relationship is critical: we don’t want to read it later and not consume a majority of the data that’s inside, since that’d be wasteful. The output at this stage is the main data structure stored in the PB. We then compress the memory, and in the general case it always compresses very well, so we save quite a lot of PB creation bandwidth just from that step alone.
The concept of tiling
Now for the bit that most people sort of understand about our hardware architecture: tiling. The tiling engine has one main job: output some data that marks out a tiled region, some associated state, and a set of pointers to the geometry that contributes to that region. We also store masks just in case a primitive doesn’t actually contribute to the region, but is stored in memory anyway. That lets us save some bandwidth and processing for that geometry, because it doesn’t contribute to the tile.
We call the resulting data structure a primitive list. If you’ve ever consumed any of our developer documentation, you’ll have seen mention of primitive lists as the intermediate data structure between the front-end phase and the pixel processing phase. Next, some more magic that’s specific to the PowerVR way of doing things.
Imagine you were tasked with building this bit of the architecture yourself, where you had to determine what regions need to be rasterised for a given set of geometries. There’s one obvious algorithm you could choose: the bounding box. Draw a box around the triangle that covers its extents, and whatever tiles that box touches are the ones you rasterise for that triangle. That falls down pretty quickly though, efficiency wise.
Imagine a fairly long and thin triangle drawn across the screen in any orientation. You can quickly picture that the bounding box for that triangle is going to lie over tiles that the triangle doesn’t actually touch. So when you rasterise, you’re going to generate work for your shading core, but where nothing is actually going to happen in terms of a contribution to the screen.
Instead, we have an algorithm baked into the hardware which we call perfect tiling. It works as you’d expect: we only generate tile lists where the geometry actually covers some area in the tile. It’s one of the most optimised and most efficient parts of the design. The perfect tiling engine generates that perfect list of tiles for a given set of geometry.
PowerVR perfect tiling vs. bounding box or hierarchical tiling
That tile information plus the primitive lists are packed into the PB as efficiently as we can, and that’s conceptually pretty much it. In reality there’s a heck of a lot that still happens here in the hardware at the back-end phase of tiling, to fill the PB and organise and marshal the actual memory accesses for the external memory writes, but in terms of functionality to wrap your head around, we’re pretty much done.
That front-end hardware architecture for us is really where a really big chuck of the efficiency gains can be found in a modern PowerVR GPU, compared to some of our competition. Surprisingly to those who find out, it’s part of the hardware architecture that’s not actually that different, at least at the top-level, between Rogue and SGX. While we completely redesigned the shader core for Rogue, the front-end architecture actually bears a strong resemblance to the one you’ll find in the later generation SGX GPU IPs. It works and it works very well.
And now that I’m done explaining the tiling part of our TBDR, it’s a good excuse to stop! I’ll come back to the deferred rendering part in a future blog post, so stay tuned.