1. Target Configuration & Target Quality
Hardware will limit maximum possible. Low end hardware can't afford high quality. To balance quality and performance, every game should define its recommended configuration and minimum configuration. Normally, we can estimate the final publish date, the main stream configuration is the recommended configuration. According to performance and API restriction to determine minimum configuration. A target configuration should include CPU, GPU, memory size, hard disk size etc. For VR game, target hardware platform is also an issue to think about.
Target quality is also need to be determined at startup time such as final graphic resolution, run-time FPS(minimum and average), image quality(old gen/next gen). Normally, FPS is determined by game type, all slow motion game like RPG can define to 30fps, all action game should be 60fps or pain for obviously latency, all VR game should have more high fps to avoid VR specified bad feeling. Then final resolution can vary according hardware performance and target fps. Normally, 1080p, 900p and 720p are some good options. Currently, most of target should afford next gen graphic quality.
2. Technique Standard
1) Graphic API
Most of hardware has more than one graphic API as option, such as OpenGL/OpenGL ES, Vulkan, D3D11, D3D12 and Metal. Basically, new API has better CPU and GPU performance. For PC perfer to use D3D12 or D3D11, for mobile use Vulkan when possible, it will reduce about 30% CPU time over OpenGL/OpenGL ES. For VR, some GPU has hardware features to support it but need new API or API extension. Try to use them when possible.
2) Global Graphic Statistics
All API drawcall will cost a lot of CPU time to validate state and fill GPU command buffer for OpenGL/OpenGL ES/D3D11. For Vulkan and D3D12, the overhead is reduced a lot. But keep batch drawcall is a good solution to optimize peformance. Every drawcall cost varies a lot. Normally keep drawcall count below 100(VR:50/eye) for mobile and 1000 for destop or high end console.
In theory, every GPU has its maximum vertex and primitive process limitation such as Triangles/s, Pixel/s, Texture/s. They will get global art budget. For example Adreno 420 fillrate is 337.5MTriangles/s, 4.8GP/s, 600MHz. That means 0.56Triangle/clock, 8.19Pixel/clock. NVIDIA GeForce GTX970 fillrate is 63.1GP/s, 117.2GTexel/s. That means desktop GPU is about 10x powerful than mobile GPU. Normally, keep global triangle count below 100K for mobile(50K~60K) and 500K for desktop/console.
After texture created in DCC tools(PS), the texture file should use raw data format when import into UE4 to keep image quality. When used for runtime game(package game), textures should be compressed by hardware supported compress method such as ASTC.
Comparing to DXTn, ETCn texture format, ASTC and PVRTC2 has better quality. NVIDIA has a very detail document about it (ASTC Texture Compression For Game Assets). The good news is all mobile GPU support ASTC in 2017. More detail in ARM website(https://developer.arm.com/products/software-development-tools/graphics-development-tools/astc-evaluation-codec).
Do generate mipmap for textures. It will have better performance and quality. But for UI or HUD, artist need to determine texture and don't generate mipmap(No usage). More small size texture has better run-time performance, Do restrict mipmap 0 texture size when graphic quality is fine. Be careful about texture size, some platform, power 2 texture size will has better performance.
In VR mode, normal mapping is not good as in normal mode. Artist ensure to check final quality in VR mode. Or parallax mapping or tessellation or more detail model.
In UE4 Editor, 'Project Settings' → 'Cooker' → 'Textures', Config PVRTC and ASTC compression quality for final project.
4) Static Mesh
After mesh created in DCC tools(3D Max, MAYA), the mesh topology may not be the best for modern GPU(Geometry Cache). Final 3D mesh must be optimized by some tools, such as Polygon Cruncher. This document has some discuss about mesh optimization. At same time, generate LOD according to graphic quality. For all view locked element(UI, HUD, eye-attached mesh), don't use any LOD(similar no mipmap for texture), artist need to estimate according to final render resolution. UE4 support Hyper LOD system.
Static mesh should restrict vertex, primitive count as well. Remove unused vertex properties such as second UV, vertex color. Keep in mind texture account affect runtime performance a lot, restrict number below 3 textures per mesh for most of meshes.
The number of polygons you should use depends on the quality you require and the platform you are targeting. For mobile devices, somewhere between 300 and 1500 polygons per mesh will give good results, whereas for desktop platforms the ideal range is about 1500 to 4000. You may need to reduce the polygon count per mesh if the game has lots of characters on screen at any given time.
5) Skeletal Mesh
Basically, skeletal mesh has similar restriction with static mesh except skeletal bone, it will cost a lot of GPU constant buffer to store them when using GPU skinning or a lot of SIMD instructions when using CPU skinning. Bone weights must keep below 4 or reducing performance.
Use as Few Bones as Possible
A bone hierarchy in a typical desktop game uses somewhere between fifteen and sixty bones. The fewer bones you use, the better the performance will be. You can achieve very good quality on desktop platforms and fairly good quality on mobile platforms with about thirty bones. Ideally, keep the number below thirty for mobile devices and don't go too far above thirty for desktop games.
We also need to verify if GPU cache friend vertex layout do help performance on same mesh.
3. Art Optimization(For Content Artist/Level Artist)
All engine has some general tips for different platform, such as UE4 tips on Mobile Platforms.
1) Do Obey Technique Standard
All technique standard are basic when creating resource.
2) Balance Mesh merging with Mesh Culling
Merge mesh(static mesh/skeletal mesh) as much as possible. Except for the mesh can be better culled before draw them, then the culled mesh will not draw at all(saving a lot of time).
3) Batch mesh rendering
Good batch will reduce a lot of draw overhead. Same mesh with same material must group render by using instance draw, for example, UE4 lets you add instances to an instancedStaticMeshComponent.
4) Merge Texture
Keep all texture size then try to merge a lot of small textures into a huge texture, for example, a lot of UI icon can be merged together. All kinds of texture compressing method has some kind of locality. That's good for GPU texture cache. In UE4, by using texture LOD group to control maximu mipmap count for different texture group.
Another way to merge texture is to use texture array, it will avoid texture switch.
5) Merge material(shader)
Different materials means GPU need to switch different states to prepare drawing primitives. Instead, same material will save a lot GPU time. Normally, meshes should use as few materials as possible, try to reuse same material for different meshes. At same time, uber shader is a good solution to try, UE4 support Material layering. But for mobile platform, uber shader will occupy too much GPRs, then less GPU threads can run paralleled, so, DO NOT use it.
6) Reduce GPU calculation
As we all know, constant buffer value will calculated in CPU and all GPU shader can use it directly. Vertex shader will pass some register to pixel shader as input. Because of vertex count is significant less than pixel count, same calculate can done by vertex shader and pass to pixel shader. In a word, try to do calculation on CPU or vertex shader. In UE4, pass value from vertex shader to pixel shader by Customized UV, you can do any kind of calculation except for UV. UE4 support parameter for material and material instance, and also support material parameter collection. Material parameter collection has better performance, it will copy to GPU only once per frame. Instead material parameter will copy from CPU to GPU every draw call.
7) Use Multi-quality Material
Different hardware has completely different performance. A general solution is to use different quality level. UE4 support three graphic level: Low, Medium, High. In the material, TA can add 'Quality Switch' node to define different shader for different graphic quality(Detail information). It's also the key to deploy different quality on different platform.
LoD shader system also works great, for far distance mesh use simplified shader, for near distance mesh use high quality shader.
8) Use Billboard & Skybox/Skydome
Simplified render mesh a lot, and easy to control image quality. Similarly, using very simplified mesh is also good.
9) Create Model to be reused by different case
Create model which can be used as some different props by just simple rotation. Which will cause a lot of meshes can be rendered by instance. A good choice is using module library, different complex meshes should be combined by different modules.
10) Vertex Format & Layout
When preparing vertex data sets, for optimal performance always use the vertex format that privides a satisfactory level of precision and also takes the least amount of space.
For vertex fetch performance, interleaved vertex attributes("xyz uv |xyz uv|...", rather than "xyz|xyz|...|uv|uv|..." ), may work more efficiently. For binning pass optimization( mobile tile-rendering), consider one array with vertex attributes and other attributes needed to compute position, and another interleaved array with other attributes.
11) Special Hardware Tips
All most all mobile GPU works in tile-based rendering architecture. That means, Vertex shader position calculation part code will call at first to determine primitive belong to which tile(only position part), then whole vertex shader executed for normal VS/PS pipeline. An extra GPU overhead exist.
Different mobile GPU has some uncommon method to reduce vertex or pixel shader calling similar with early-z. Alpha mask or 'discard' instruction will disable these features, and pixel shader will keep invoked. Instead, no bandwidth cost for alpha-blend and depth/stencil testing is cheap. So, some time, alpha blend will have better performance. If you don't sure, do profiling to compare result. Most of time, MSAA and alpha-blend is almost free, depth/stencil testing is cheap. For some mobile GPU, there's free hidden surface removal for opaque then CPU overhead can be removed.
Mobile GPU has bed performance on render buffer. Don't change render target dynamically(using discard and create new). Some time clear render buffer will improve performance because GPU only quick flag the buffer and has better buffer management on it.
4. CPU Optimization(For Engine Programmer)
1) Maximum Culling
Rendering nothing is most efficient rendering. Before send drawcall to GPU, game should remove meshes drawcall as much as possible. Do check following culling system.
- View frustum Culling
The viewing frustum is a geometric representation of the volume visible to the virtual camera. Naturally, objects outside this volume will not be visible in the final image, so they are discarded. Often, objects lie on the boundary of the viewing frustum. These objects are cut into pieces along this boundary in a process called clipping, and the pieces that lie outside the frustum are discarded as there is no place to draw them. CPU will cull mesh according to view frustum and mesh bounding box. GPU will do very detial culling(clip) on every primitive(triangle). In UE4, view frustum is based on camera property, after setting camera attribution correctly, it should be done automatically. For each mesh, UE4 will generate a bounding box and bounding sphere when importing mesh into UE4 editor. In some case, mesh bounding box(bounding sphere) should be adjust manually such as changing mesh rendering order.
- Backface Culling
With 3D objects, only half of the surfaces are facing the camera, and the rest are facing away from the camera, i.e. are on the back side of the object, hindered by the front side. If the object is completely opaque, those surfaces never need to be drawn. They are determined by the vertex winding order: if the triangle drawn has its vertices in clockwise order on the projection plane when facing the camera, they switch into counter-clockwise order when the surface turns away from the camera.
Incidentally, this also makes the objects completely transparent when the viewpoint camera is located inside them, because then all the surfaces of the object are facing away from the camera and are culled by the renderer. To prevent this the object must be set as double-sided (i.e. no backface culling is done) or have separate inside surfaces.
For locked view or limited view case, we can directly remove all back face when modeling mesh. Diablo III optimized a lot of scene mesh by this method. GPU will remove backface automatically(still cost time).
- Contribution Culling
Often, objects are so far away that they do not contribute significantly to the final image. These objects are thrown away if their screen projection is too small. On CPU side, skipping draw all meshes too far from camera. On GPU side, clip plane can be configured to remove all primitives.
- Occlusion Culling
Objects that are entirely behind other opaque objects may be culled. This is a very popular mechanism to speed up the rendering of large scenes that have a moderate to high depth complexity. There are several types of occlusion culling approaches:
- Potentially visible set or PVS rendering, divides a scene into regions and pre-computes visibility for them. These visibility sets are then indexed at run-time to obtain high quality visibility sets (accounting for complex occluder interactions) quickly.
- Portal rendering divides a scene into cells/sectors (rooms) and portals (doors), and computes which sectors are visible by clipping them against portals.
Hansong Zhang's dissertation "Effective Occlusion Culling for the Interactive Display of Arbitrary Models" describes an occlusion culling approach.
Bounding volume hierarchies (BVHs) are often used to subdivide the scene's space (examples are the BSP tree, the octree and the kd-tree). This allows visibility determination to be performed hierarchically: effectively, if a node in the tree is considered to be invisible then all of its child nodes are also invisible, and no further processing is necessary (they can all be rejected by the renderer). If a node is considered visible, then each of its children need to be evaluated. This traversal is effectively a tree walk where invisibility/occlusion or reaching a leaf node determines whether to stop or whether to recuse respectively.
For a large scene(with a lot of meshes), culling is a very heavy operation, need profiling and optimized carefully. And a tip, all 100% translucent object should be hiden, don't need to render them on GPU.
2) Batch Drawcall
Drawcall itself is expensive, GPU state changing is expensive too. Batch drawcall is to reduce these overhead as much as possible. Current computer system support following method to do it:
This video has a detail description about 'DrawInstance' in UE4 Engine. It reduce huge of drawcall CPU driver overhead and draw tones of same mesh in only one drawcall.If a lot of same mesh with same meterial are rendered, it must be used.
- Single Pass Multiview
NVIDIA Pascal based GPUs support a new feature Single Pass Stereo. It will draw same mesh on both eyes and double the geometric complexity of VR applications. If the hardware support it, enable it.
If some meshes use the same material, obviously, the meshes can be combined together into one mesh. Then reduce several separated drawcall into one combined drawcall. It similar with artist combine meshes. Some extra memory need to store combined mesh.
Similar with StaticBatch, it also combines meshes with same material, but, the combine operation at runtime, extra CPU time to combine mesh data, extra memory to store combined mesh. It affect perfomance a lot, do profiling and comparing different result.
UE4 don't support dynamic batch features comparing with Unity engine( Unity Draw Call Batching ). For Vulkan, drawcall overhead reduced a lot, do profiling to compare performance.
3) Sort Drawcall
After determined all draw meshes, the drawcall order will affect performance a lot. There are basically two method to sort drawcalls:
- Sort by Distance
All meshes can be divided into two types(opaque and translucent). All translucent objects must be rendered from back to front or ther result looks odd. But opaque objects can be rendered in any order, depth-test will ensure final result. But to avoid over-draw(unused pixel operation), they should be render from front to back, then front object will block back objects to render and avoid over-draw.
- Sort by Render State
Avoid render state changing as much as possible, every state change will cost a lot CPU&GPU time. Sort similar render state drawcall together will remove a lot of render state changing.
Most of GPU need combine two kind of sorting operation. Different platform has different performance result. Do try them and compare profiling result. In some platform(PowerVR GPU), GPU has ensure removed all hide face. So, only need to sort by render state. Data-oriented system can be involved and using clustering rendering.
– 1. Group draws by material (shader) to reduce state changes
Then for all platforms except ImgTec,
– 2. Skybox last: 5 ms/frame savings (vs drawing skybox first)
– 3. Sort groups nearest first : extra 3 ms/frame savings
– 4. Sort inside groups nearest first : extra 7 ms/frame savings
4) Mobile Technique Features
- Sustained Performance Mode
Android N support for sustained performance mode, enabling OEMs to provide hints about device-performance capabilities for long running apps.
- Frontbuffer Strip Rendering
Screen tearing is bad graphic issue when rendering directly to frontbuffer. In previous time, game will use double or tripple buffer and swap different buffer to avoid it. But that will cause a lot of latency. New hardware support frontbuffer strip rendering. No backbuffer anymore, keep the game render fast than 60Hz, then it will render directly into front buffer. This document has more detail. Enable it when possible.
- Render Pipeline
The whole render frame can be designed into forward rendering, deferred rendering, forward+ rendering and cluster rendering. For PC and console game, deferred rendering is a good choice, for mobile game, normally choose forward or forward+ rendering. UE4 has support them now. Oculus guys did some optimization on UE4 render for VR.
- Post Process
As mentioned before, mobile GPU has very bad performance on frame buffer, it's better to avoid all kinds of post process.
- Light & Shadow
PBR can be calculated on mobile platform, it don't affect performance a lot. Instead, shadow map is bad for mobile game. Try to using lightmap or you need to optimize performance for it.
HDR will involve a heavy post process. LDR is the choice for current mobile game.
Almost all mobile GPU offer free 2x or 4x MSAA. Just use it. TAA or FXAA cost too much time now.
Unreal Engine has a very detail document about how to config graphic options for mobile.
5) VR Technique Features
- Asynchronous Time Warp/Asynchronous Space Warp
ATW also known as Reprojection can reduce latency and increase or maintain frame rate by warping the rendered image before sending to the display. Most of time game with high graphic quality will cause low frame rate. Then enable it in UE4 to avoid bad VR feeling. Details on Occulus website. XinReality VR&AR WIKI compared different platform ATW implementation.
- Instanced Stereo
For some platform, UE4 can using instance draw to reduce CPU & GPU overhead on rendering same mesh on both eyes. Do enable it when possible.
Some other new render optimization features can be tried:
- Monoscopic Rendering
- Multi-Res Shading
- Lens Matched Shading
- HMD Direct Render Mode
Refer to Oculus website and NVIDIA VRWorks.
5. GPU Optimization(For Graphic Programmer)
1) Computer Shader
Computer shader has less overhead than graphic shader(VS/PS). It's very useful to do some general computation.
2) Async Compute
Computer shader can be execute at same time with graphic shader when it blocked by some fixpipeline functions. That means GPU can fully work all time.
3) GPU Instructions Standard
Less than 125 for mobile.
4) Alpha Mask
Discard/Clip has bad performance in mobile platform. Try using alpha blending.
5) Minimum GRPs
Threads = (Total GRP) / (GRP per Thread). More GRPs per shader will cause less concurrent threads.
6) Vectorization and Explicit MAD
Some previous GPU use SIMD, it will calculate 4 dimension vector per cycle. But in most recent GPU, only scalar operation, all vector operation will change to several scalar operation, so these optimization is useless.Instead, try to use scalar operation.
7) Share Preceding Computation
Put computation out of loop as much as possible to share computation result.
8) Rearrange Scalar/Vector Operation
For example, M' = M * a * b, should be change to M' = M * (a*b), if every vector operation will convert to several scalar operation, the original version(32) almost cost 2 times than modified version(17).
9) Use Modifiers as Input
For example, V' += -normalize(V)*a, should be V' += normalize(-V)*a.
10) Balance ALU or Texture Lookup
Some function can be calculated at runtime by GPU with some ALU or store result in extra texture then lookup texture to get result. For different platform, the two methods have different performance. Choose the fast solution for specified platform, for example, for Qualcomm Adreno, 16:1 ratio is a good rule of thumb, as long as the texture fetch might replace 16 or more arithmetic instructions, it can be worthwihile.
Current generation mobile hardware still has a lot of restriction, but the performance is upgraded a lot, such as battery, tile rendering, bed framebuffer performance. Different platform has some special optimization by hardware. When optimizing rendering performance, hardware detail is very important. At last all kinds of optimization must be determined according to profiling result.
1. High Quality Mobile VR with Unreal Engine and Oculus (Presented by ARM)[Daniele Di Donato, Remi Palandri, Ryan Vance] http://gdcvault.com/play/1024393/
2. Next-gen Mobile Rendering[Niklas Smedberg, Timothy Lottes] https://de45xmedrsdbp.cloudfront.net/Resources/files/GDC2014_Next_Generation_Mobile_Rendering-2033767592.pdf
3. UE4 VR Best Practices [Luis Cataldi] https://www.slideshare.net/luiscataldi/luis-cataldiue4vrbestpractices2-58934932?qid=9f3add3f-fe0e-4e8b-8d9a-16854120fd4f&v=&b=&from_search=44
4. The Mali GPU: An Abstract Machine[Peter Harris] https://community.arm.com/graphics/b/blog/posts/the-mali-gpu-an-abstract-machine-part-1---frame-pipelining
5. A Trip through The Graphics Pipeline 2011[Fabian "ryg" Giesen] https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/https://docs.unrealengine.com/latest/INT/Platforms/Mobile/Performance/
6. GPU Architecture Introduction[Liou Jhe-Yu] http://caslab.ee.ncku.edu.tw/dokuwiki/_media/member:elvis:gpu_architecture_introduction.pptx
7. UE4 - Overview of Static Mesh Optimization Options[Bob Cober] http://www.casualdistractiongames.com/single-post/2017/01/07/UE4---Overview-of-Static-Mesh-Optimization-Options
8. UE4 Performance Guidelines for Mobile Devices: https://docs.unrealengine.com/latest/INT/Platforms/Mobile/Performance/index.html [Good Mobile Graphic Quality Configuration Guide]
9. Bring AAA graphics to mobile platforms[Niklas Smedberg] https://cdn2.unrealengine.com/Resources/files/Smedberg_Niklas_Bringing_AAA_Graphics-v2-1449610220.pdf
10. Unreal Engine 4 Optimization Tutorial[Errin M. Jeff Rous] https://software.intel.com/en-us/articles/unreal-engine-4-optimization-tutorial-part-1
11. Optimizing the UE4 renderer for Ethan Carter VR[Leszek Godlewski] https://medium.com/@TheIneQuation/the-vanishing-of-milliseconds-dfe7572d9856#.84a967rrx
12. Using UE4 to develop VR project[Mullin] http://gad.qq.com/article/detail/7170307
13. Adreno OpenGL ES Developer Guide https://developer.qualcomm.com/qfile/28557/80-nu141-1_b_adreno_opengl_es_developer_guide.pdf
14. Content Creation for Mobile Platforms[UE4 Docs]: https://docs.unrealengine.com/latest/INT/Platforms/Mobile/Content/
15. UE4 - Performance Guidelines for Artists and Designers: https://docs.unrealengine.com/latest/INT/Engine/Performance/Guidelines/
16. High Quality Mobile VR with Unreal Engine and Oculus[ARM GDC17] https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/High_Quality_Mobile_VR_with_Unreal_Engine_and_Oculus.pdf
17. Introduction to Best Parctices[Oculus] https://developer3.oculus.com/documentation/intro-vr/latest/concepts/bp_intro/
18. Performance Optimization for Mobile Devices[Chris] http://robotinvader.com/blog/?p=438
19. Optimized Effects for Mobile Devices http://malideveloper.arm.com/downloads/Optimized-Effects-For-Mobile-Devices.pdf
20. Low Level Thinking in High[Emil Persson] http://gdcvault.com/play/1018182/Low-Level-Thinking-in-High
21. Approaching Zero Driver Overhead[Cass Everitt, Graham Sellers, John McDonald, Tim Foley] https://www.slideshare.net/CassEveritt/approaching-zero-driver-overhead
20. 基于UE4的VR内容在AMD GCN架构上的性能优化指南[锐VR] http://mp.weixin.qq.com/s/XxWmIDYBQQ52S9bIbRKkAQ
21. UE4 - Unreal Open Day 2017 UE4 for Mobile The Future of High Quality Mobile Games[Jack Porter] https://www.slideshare.net/EpicGamesChina/unreal-open-day-2017-ue4-for-mobile-the-future-of-high-quality-mobile-games
22. Unity - ARM Guide for Unity Developers http://malideveloper.arm.com/downloads/DeveloperGuides/arm_guide_for_unity_developers_3_3.pdf
23. Performance Optimization for VR APPs[DARSHAN SHANKAR] https://dshankar.svbtle.com/performance-optimization-for-vr-apps
24. Rendering in UE4(Epic Game TA-Homam Bahnassi presentation notes) https://zhuanlan.zhihu.com/p/35075351
25. Polycounts in next gen games thread! https://polycount.com/discussion/141061/polycounts-in-next-gen-games-thread
26. Texture and buffer access performance http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/