移动端图形架构之PowerVR of Imagination Tile-based rendering

目录

PowerVR

IMR

IMR的缺点

TBDR

TBDR 架构分析:


PowerVR

通过本篇打算详细介绍 PowerVR的Tile-Baed Deffered Rendering下面简称TBDR 

目前渲染模式主要分两种,Immediate Mode Rendering 下面简称IMR。

IMR

TMR 就是一种设计比较简单的流水线作业多架构方式,主要在PC端工作,它无需考虑大量的带宽消耗,丢过来渲染数据,然后直接渲染,因此渲染速率很高。 它的渲染流程图如下:

IMR的缺点

IMR的一些缺点导致它不会在嵌入式端(包含移动端)使用。这其中最大的原因是因为嵌入式端具有架构小巧,低耗电量特。其中性能最重要的是带宽的消耗。IMR在这这方面做的不好,尤其是着色这块,这是因为IMR 通常也会把在屏幕上不可见的像素绘制出来,而像素的数量往往会多余顶点的数量,还有一点是屏幕的像素在绘制到屏幕之前需要被写到内存外面然后读写到内存几次,所以会消耗大量珍贵的带宽。而且不透明物体在GPU内部没有排序,为了提高性能的话得需要程序员在cpu端手动进行排序。

PowerVR 架构分析

下面从架构上来分析它的优点。

3D 图形应用程序的开始是确定集合数据在内存当中的位置。然后告诉GPU获取它然后执行顶点着色 在GPU内部有负责非程序化执行不同任务类型的叫做 The data masters的东西。他们代表the Universal Shading Cluster or USC (通用着色集群或USC(我们的着色核心)做了一堆不同的事情,包括从内存中获取数据。比如 对于顶点着色任务,需要the vertex data master (VDM),对于像素着色程序需要 the pixel data master PDM,VDM为例根据驱动提供的信息负责从内存当中获取顶点数据。

PowerVR Series7XT is the latest family of Rogue GPUs

 

 

VDM 会通过其他内置的程序块去获取需要的数据,最终会交给USC进行顶点着色。USC会运行着色程序然后存储在GPU里面叫做 on-chip 的结构里面。

然后硬件会执行图元光栅化,进行剔除,裁剪。裁剪的方式可以为背面裁剪或者正面裁剪。如果几何图形与一个屏幕内的平面相交,裁剪器将生成新的几何图形,以便使得交叉的顶点完全在屏幕上显示。

power vr 和其他移动端架构都一样都是 将屏幕分成一个个 tile,然后分别对每一个tile然后 做 tiling处理,通过这种方式来减少带宽。 tiling 是将处理后的几何数据放入称作tiles小矩形区域的过程。光栅化和像素处理发生在每一个tile中。TBR 包含两个阶段:顶点处理和每个tile光栅化。

PowerVR GPUs split the screen into tiles
PowerVR GPUs split the screen into tiles

下面说一下这些分块渲染的数据在结构上长什么样子:如下图三种渲染方法:比如渲染一个三角形一般能想到是中间的那种box的形式,但是这样会导致一些不被三角形覆盖到的位置也会被渲染,第三种更低效,所以为了更大效率的节省带宽在TBDR中会以下图中一的方式去渲染一个三角形,这种方法显然有最少的三角行非覆盖区域,这种方法也叫做perfect tiling。


 

PowerVR perfect tiling vs. bounding box or hierarchical tiling

That tile information plus the primitive lists are packed into the PB as efficiently as we can 。

延迟渲染的执行过程:打包成PB-> Fetching the PB back into the core(读取PB到GPU内部 ->三角形边缘公式->MSAA

把屏幕分成块以便使得他们很容易被处理,并且足够小能否容纳到整个GPU里面,从而减少访问内存的需求,这一部分会在每一个块当中把当前需要更新的区域存储到一个叫做图元链表当中会让每一个块的图元链表生成像素尽量延迟,这样就可以尽可能减少渲染次数。

关于屏幕分块:究竟是分的块多好还是少好呢?关于这点GPU考虑到的是块内部三角形重叠的情况,原则是尽量少的规避三角形重叠情况的处理(这是因为分的块越大块内部三角形数量就越多, 如果块有重叠交叉情况的时候GPU内部就得需要额外的空间去处理他们,另外一个原因是 结合使用MSAA,每一个三角形需要渲染的像素的数量是原有像素的四倍,所以三角形越多占用的空间越大),所以需要分的块越多越好。

下面说一下MSAA,MSAA 通过产生更多的子像素来提高渲染质量,在嵌入式端,MSAA对带宽是没有影响的。以前不太理解为什么能在移动端使用MSAA,现在终于理解了。因为GPU 内部会提供额外的存储空间去存储这些子像素的采样从而提高渲染效率,而不是在内存当中,所以对带宽没有影响。

 

传统移动端架构减少带宽的方式

 

其中每一个tile上都有一个大小与它匹配上的 on- chip 缓冲区,可以存储 并高效的读写各种buffer,颜色 深度 和 模板buffer。这可以使得避免来回的在GPU

与内存之间来回的转移各种buffer 从而节省了带宽。

分块渲染还有一个好处是可以利用共享数据,由于每一个块的数据都会被记录下来,那么相邻的块与块之间可以得到并利用彼此的数据,将数据提取到缓存中,无需等待另一组外部内存访问,再次节省了带宽。

综上所述 TBR可以利用分块渲染技术最大限度的节省带宽,但是TBR并没有减少overdraw的问题(这个也是我之间理解错误的原因因为上面 TBR 管线当中有一个 visiblity test注意前面少了一个early,我开始把它理解成了early-z),当渲染每一个tile的时候,几何体按照提交的顺序进行处理,被遮挡的物体仍旧会接着进行处理,结果也会进行多余着色计算,最后被丢弃,从而产生了严重的overdraw。

 

TBDR 架构分析:

TBDR 是为解决overdraw而出现的

Tile Based Deferred Rendering (TBDR) pipeline. TBDR rendering splits the per-tile rendering process into two stages namely Hidden Surface Removal (HSR) and deferred pixel shading. When a scene composed of three-dimensional objects is created, some of the objects and surfaces may obscure all or parts of others. Hidden Surface Removal is the process by which the obscured sections of objects in a scene are removed from the render. Deferred rendering means that the architecture will defer all texturing and shading operations until all objects that could be deferred, primarily opaque geometry, have been tested for visibility. The efficiency of HSR is such that overdraw can be removed entirely for completely opaque renders. This significantly reduces system memory bandwidth requirements.

为什么 alpha test 比 alpha blend 费的原因

1 The reason alpha test/discard is more costly is that is goes through the HSR stage twice; once to perform depth/stencil tests to see if any of the fragments are obscured, and a second time to write depth/stencil values for any fragments that were not discarded when the fragment shader executed.

2 In the blended primitive case, depth tests and writes can be performed before the shader executes (one pass through HSR). For alpha tested primitives, depth testing can be performed on the initial pass, but a second pass through HSR is required after shader execution to update the on-chip depth buffer with data for the visible fragments.

3 As soon as the ISP has finished processing a primitive, it can begin processing another.

4 in the pipeline for alphablend the next primitive can enter ISP earlier than alphatest,In the blend scenario, the ISP can continue to process new primitives while the USCs are calculating colours for primitives that have already propagated through the pipeline.

5 The ISP unit (where HSR is performed) has access to the position data of all primitives within the tile. With this information, it can perform depth and stencil reads/writes immediately. Blended primitives go through the exact same stages as opaque primitives. As all fragments of a blended object are considered to be visible, depth tests and writes can be performed up front. The only difference is that blended primitives must be sent down the pipeline one at a time to ensure they are processed in the submission order specified by the application.

Blended primitives will do the following:

  1. ISP HSR: Depth and and stencil tests and writes
  2. Shading: Colours are calculated for fragments that pass the tests

For alpha testing, there isn’t any preprocessing either. The only difference between alpha test and the blended path is that depth and stencil writes must be deferred until the shader has executed.

Alpha tested primitives will do the following:

  1. ISP HSR: Depth and and stencil tests (no writes)
  2. Shading: Colours are calculated for fragments that pass the tests
  3. Visibility feedback to ISP: After the shader has executed, the GPU knows which fragments were discarded and which where kept. Visibility information is fed back to the ISP so depth and stencil writes can be performed for the fragments that passed the alpha test

 

Reference:

https://www.imaginationtech.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/

http://cdn.imgtec.com/sdk-documentation/PowerVR+Hardware.Architecture+Overview+for+Developers.pdf

https://forums.imgtec.com/t/alpha-test-vs-alpha-blend/2291/3

 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值