SIGGRAPH '97
Course 24: OpenGL and Window System Integration
OpenGL Performance Optimization
Contents
- 1. Hardware vs. Software
- 2. Application Organization
- 3. OpenGL Optimization
- 4. Evaluation and tuning
1. Hardware vs. Software
2. Application Organization
2.1 High Level Organization
Multiprocessing
SGI's Performer is an example of a high level toolkit designed for this purpose.
Image quality vs. performance
- During interactive rotation (i.e. mouse button held down) render a reduced-polygon model. When drawing a static image draw the full polygon model.
- During animation, disable dithering, smooth shading, and/or texturing. Enable them for the static image.
- If texturing is required, use
GL_NEAREST
sampling andglHint( GL_PERSPECTIVE_CORRECTION_HINT, GL_FASTEST )
. - During animation, disable antialiasing. Enable antialiasing for the static image.
- Use coarser NURBS/evaluator tesselation during animation. Use
glPolygonMode( GL_FRONT_AND_BACK, GL_LINE )
to inspect tesselation granularity and reduce if possible.
Level of detail management and culling
2.2 Low Level Organization
An Example
struct city {
float latitute, longitude; /* city location */
char *name; /* city's name */
int large_flag; /* 0 = small, 1 = large */
};
A list of cities may be stored as an array of city structs.
Our first attempt at rendering this information may be:
void draw_cities( int n, struct city citylist[] )
{
int i;
for (i=0; i < n; i++) {
if (citylist[i].large_flag) {
glPointSize( 4.0 );
}
else {
glPointSize( 2.0 );
}
glBegin( GL_POINTS );
glVertex2f( citylist[i].longitude, citylist[i].latitude );
glEnd();
glRasterPos2f( citylist[i].longitude, citylist[i].latitude );
glCallLists( strlen(citylist[i].name),
GL_BYTE,
citylist[i].name );
}
}
This is a poor implementation for a number of reasons:
glPointSize
is called for every loop iteration.- only one point is drawn between
glBegin
andglEnd
- the vertices aren't being specified in the most efficient manner
Here's a better implementation:
void draw_cities( int n, struct city citylist[] )
{
int i;
/* draw small dots first */
glPointSize( 2.0 );
glBegin( GL_POINTS );
for (i=0; i < n ;i++) {
if (citylist[i].large_flag==0) {
glVertex2f( citylist[i].longitude, citylist[i].latitude );
}
}
glEnd();
/* draw large dots second */
glPointSize( 4.0 );
glBegin( GL_POINTS );
for (i=0; i < n ;i++) {
if (citylist[i].large_flag==1) {
glVertex2f( citylist[i].longitude, citylist[i].latitude );
}
}
glEnd();
/* draw city labels third */
for (i=0; i < n ;i++) {
glRasterPos2f( citylist[i].longitude, citylist[i].latitude );
glCallLists( strlen(citylist[i].name),
GL_BYTE,
citylist[i].name );
}
}
struct city_list {
int num_cities; /* how many cities in the list */
float *position; /* pointer to lat/lon coordinates */
char **name; /* pointer to city names */
float size; /* size of city points */
};
/* indicates if server can do GL_EXT_vertex_array: */
GLboolean varray_available;
void draw_cities( struct city_list *list )
{
int i;
GLboolean use_begin_end;
/* draw the points */
glPointSize( list->size );
#ifdef GL_EXT_vertex_array
if (varray_available) {
glVertexPointerEXT( 2, GL_FLOAT, 0, list->num_cities, list->position );
glDrawArraysEXT( GL_POINTS, 0, list->num_cities );
use_begin_end = GL_FALSE;
}
else
#else
{
use_begin_end = GL_TRUE;
}
#endif
if (use_begin_end) {
glBegin(GL_POINTS);
for (i=0; i < list->num_cities; i++) {
glVertex2fv( &position[i*2] );
}
glEnd();
}
/* draw city labels */
for (i=0; i < list->num_cities ;i++) {
glRasterPos2fv( list->position[i*2] );
glCallLists( strlen(list->name[i]),
GL_BYTE, list->name[i] );
}
}
In the following sections the techniques for maximizing performance, as seen above, are explained.
3. OpenGL Optimization
- H - beneficial for high-end hardware
- L - beneficial for low-end hardware
- S - beneficial for software implementations
- all - probably beneficial for all implementations
3.1 Traversal
-
Use connected primitives
-
Connected primitives such as
GL_LINES, GL_LINE_LOOP, GL_TRIANGLE_STRIP, GL_TRIANGLE_FAN
, andGL_QUAD_STRIP
require fewer vertices to describe an object than individual line, triangle, or polygon primitives. This reduces data transfer and transformation workload. [all]
-
Use the vertex array extension
-
On some architectures function calls are somewhat expensive so replacing many
glVertex/glColor/glNormal
calls with the vertex array mechanism may be very beneficial. [all]
-
Store vertex data in consecutive memory locations
- When maximum performance is needed on high-end systems it's good to store vertex data in contiguous memory to maximize through put of data from host memory to graphics subsystem. [H,L]
-
Use the vector versions of
-
The
glVertex
,glColor
, etc. functions which take a pointer to their arguments such asglVertex3fv(v)
may be much faster than those which take individual arguments such asglVertex3f(x,y,z)
on systems with DMA-driven graphics hardware. [H,L]
glVertex
, glColor
, glNormal
and glTexCoord
-
Reduce quantity of primitives
- Be careful not to render primitives which are over-tesselated. Experiment with the GLU primitives, for example, to determine the best compromise of image quality vs. tesselation level. Textured objects in particular may still be rendered effectively with low geometric complexity. [all]
-
Display lists
- Use display lists to encapsulate frequently drawn objects. Display list data may be stored in the graphics subsystem rather than host memory thereby eliminating host-to-graphics data movement. Display lists are also very beneficial when rendering remotely. [all]
-
Don't specify unneeded per-vertex information
-
If lighting is disabled don't call
glNormal
. If texturing is disabled don't callglTexCoord
, etc.
-
Minimize code between
-
For maximum performance on high-end systems it's extremely important to send vertex data to the graphics system as fast as possible. Avoid extraneous code between
glBegin/glEnd
.glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n; i++) {
if (lighting) {
glNormal3fv( norm[i] );
}
glVertex3fv( vert[i] );
}
glEnd();
This is a very bad construct. The following is much better:
if (lighting) {
Also consider manually unrolling important rendering loops to maximize the function call rate.
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n ;i++) {
glNormal3fv( norm[i] );
glVertex3fv( vert[i] );
}
glEnd();
}
else {
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n ;i++) {
glVertex3fv( vert[i] );
}
glEnd();
}
glBegin/glEnd
3.2 Transformation
-
Lighting
-
- Avoid using positional lights, i.e. light positions should be of the form (x,y,z,0) [L,S]
- Avoid using spotlights. [all]
- Avoid using two-sided lighting. [all]
- Avoid using negative material and light color coefficients [S]
- Avoid using the local viewer lighting model. [L,S]
- Avoid frequent changes to the
GL_SHININESS
material parameter. [L,S] - Some OpenGL implementations are optimized for the case of a single light source.
- Consider pre-lighting complex objects before rendering, ala radiosity. You can get the effect of lighting by specifying vertex colors instead of vertex normals. [S]
-
Two sided lighting
- If you want both the front and back of polygons shaded the same try using two light sources instead of two-sided lighting. Position the two light sources on opposite sides of your object. That way, a polygon will always be lit correctly whether it's back or front facing. [L,S]
-
Disable normal vector normalization when not needed
-
glEnable/Disable(GL_NORMALIZE)
controls whether normal vectors are scaled to unit length before lighting. If you do not useglScale
you may be able to disable normalization without ill effects. Normalization is disabled by default. [L,S]
-
Use connected primitives
-
Connected primitives such as
GL_LINES
,GL_LINE_LOOP
,GL_TRIANGLE_STRIP
,GL_TRIANGLE_FAN
, andGL_QUAD_STRIP
decrease traversal and transformation load.
-
If you have to draw many rectangles consider using
glBegin(GL_QUADS)
...glEnd()
instead. [all]
glRect
usage
3.3 Rasterization
-
Disable smooth shading when not needed
- Smooth shading is enabled by default. Flat shading doesn't require interpolation of the four color components and is usually faster than smooth shading in software implementations. Hardware may perform flat and smooth-shaded rendering at the same rate though there's at least one case in which smooth shading is faster than flat shading (E&S Freedom). [S]
-
Disable depth testing when not needed
- Background objects, for example, can be drawn without depth testing if they're drawn first. Foreground objects can be drawn without depth testing if they're drawn last. [L,S]
-
Disable dithering when not needed
- This is easy to forget when developing on a high-end machine. Disabling dithering can make a big difference in software implementations of OpenGL on lower-end machines with 8 or 12-bit color buffers. Dithering is enabled by default. [S]
-
Use back-face culling whenever possible.
- If you're drawing closed polyhedra or other objects for which back facing polygons aren't visible there's probably no point in drawing those polygons. [all]
-
The GL_SGI_cull_vertex extension
- SGI's Cosmo GL supports a new culling extension which looks at vertex normals to try to improve the speed of culling.
-
Avoid extra fragment operations
- Stenciling, blending, stippling, alpha testing and logic ops can all take extra time during rasterization. Be sure to disable the operations which aren't needed. [all]
-
Reduce the window size or screen resolution
- A simple way to reduce rasterization time is to reduce the number of pixels drawn. If a smaller window or reduced display resolution are acceptable it's an easy way to improve rasterization speed. [L,S]
3.4 Texturing
-
Use efficient image formats
-
The
GL_UNSIGNED_BYTE
component format is typically the fastest for specifying texture images. Experiment with the internal texture formats offered by theGL_EXT_texture
extension. Some formats are faster than others on some systems (16-bit texels on the Reality Engine, for example). [all]
-
Encapsulate texture maps in texture objects or display lists
- This is especially important if you use several texture maps. By putting textures into display lists or texture objects the graphics system can manage their storage and minimize data movement between the client and graphics subsystem. [all]
-
Use smaller texture maps
- Smaller images can be moved from host to texture memory faster than large images. More small texture can be stored simultaneously in texture memory, reducing texture memory swapping. [all]
-
Use simpler sampling functions
- Experiment with the minification and magnification texture filters to determine which performs best while giving acceptable results. Generally, GL_NEAREST is fastest and GL_LINEAR is second fastest. [all]
-
Use the same sampling function for minification and magnification
-
If both the minification and magnification filters are
GL_NEAREST
orGL_LINEAR
then there's no reason OpenGL has to compute the lambda value which determines whether to use minification or magnification sampling for each fragment. Avoiding the lambda calculation can be a good performace improvement.
-
Use a simpler texture environment function
-
Some texture environment modes may be faster than others. For example, the
GL_DECAL
orGL_REPLACE_EXT
functions for 3 component textures is a simple assignment of texel samples to fragments whileGL_MODULATE
is a linear interpolation between texel samples and incoming fragments. [S,L]
-
Combine small textures
- If you are using several small textures consider tiling them together as a larger texture and modify your texture coordinates to address the subtexture you want. This technique can eliminate texture bindings.
-
Use glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_FASTEST)
- This hint can improve the speed of texturing when perspective- correct texture coordinate interpolation isn't needed, such as when using a glOrtho() projection.
-
Animated textures
-
If you want to use an animated texture, perhaps live video textures, don't use
glTexImage2D
to repeatedly change the texture. UseglTexSubImage2D
orglTexCopyTexSubImage2D
. These functions are standard in OpenGL 1.1 and available as extensions to 1.0.
3.5 Clearing
-
Use
-
Clear all relevant color buffers with one
glClear
.glClear( GL_COLOR_BUFFER_BIT );
Right:
if (stenciling) {
glClear( GL_STENCIL_BUFFER_BIT );
}
if (stenciling) {
glClear( GL_COLOR_BUFFER_BIT | GL_STENCIL_BUFFER_BIT );
}
else {
glClear( GL_COLOR_BUFFER_BIT );
}
glClear
carefully [all]
-
Disable dithering
- Disable dithering before clearing the color buffer. Visually, the difference between dithered and undithered clears is usually negligable.
-
Use scissoring to clear a smaller area
-
If you don't need to clear the whole buffer use
glScissor()
to restrict clearing to a smaller area. [L].
-
Don't clear the color buffer at all
- If the scene you're drawing opaquely covers the entire window there is no reason to clear the color buffer.
-
Eliminate depth buffer clearing
-
If the scene you're drawing covers the entire window there is a trick which let's you omit the depth buffer clear. The idea is to only use half the depth buffer range for each frame and alternate between using GL_LESS and GL_GREATER as the depth test function.
int EvenFlag;
/* Call this once during initialization and whenever the window
* is resized.
*/
void init_depth_buffer( void )
{
glClearDepth( 1.0 );
glClear( GL_DEPTH_BUFFER_BIT );
glDepthRange( 0.0, 0.5 );
glDepthFunc( GL_LESS );
EvenFlag = 1;
}
/* Your drawing function */
void display_func( void )
{
if (EvenFlag) {
glDepthFunc( GL_LESS );
glDepthRange( 0.0, 0.5 );
}
else {
glDepthFunc( GL_GREATER );
glDepthRange( 1.0, 0.5 );
}
EvenFlag = !EvenFlag;
/* draw your scene */
}
-
Avoid glClearDepth( d ) where d!=1.0
- Some software implementations may have optimized paths for clearing the depth buffer to 1.0. [S]
3.6 Miscellaneous
-
Avoid "round-trip" calls
-
Calls such as
glGetFloatv, glGetIntegerv, glIsEnabled, glGetError, glGetString
require a slow, round trip transaction between the application and renderer. Especially avoid them in your main rendering code.
-
Avoid
-
If only a few pieces of state need to be saved and restored it's often faster to maintain the information in the client program.
glPushAttrib( GL_ALL_ATTRIB_BITS )
in particular can be very expensive on hardware systems. This call may be faster in software implementations than in hardware. [H,L]
glPushAttrib
-
Check for GL errors during development
-
During development call
glGetError
inside your rendering/event loop to catch errors. GL errors raised during rendering can slow down rendering speed. Remove theglGetError
call for production code since it's a "round trip" command and can cause delays. [all]
-
Use
-
If you need to change a material property on a per vertex basis,
glColorMaterial
may be faster thanglMaterial
. [all]
glColorMaterial
instead of glMaterial
-
Avoid using viewports which are larger than the window
- Software implementations may have to do additional clipping in this situation. [S]
-
Alpha planes
- Don't allocate alpha planes in the color buffer if you don't need them. Specifically, they are not needed for transparency effects. Systems without hardware alpha planes may have to resort to a slow software implementation. [L,S]
-
Accumulation, stencil, overlay planes
- Do not allocate accumulation, stencil or overlay planes if they are not needed. [all]
-
Be aware of the depth buffer's depth
- Your OpenGL may support several different sizes of depth buffers- 16 and 24-bit for example. Shallower depth buffers may be faster than deep buffers both for software and hardware implementations. However, the precision of of a 16-bit depth buffer may not be sufficient for some applications. [L,S]
-
Transparency may be implemented with stippling instead of blending
- If you need simple transparent objects consider using polygon stippling instead of alpha blending. The later is typically faster and may actually look better in some situations. [L,S]
-
Group state changes together
- Try to mimimize the number of GL state changes in your code. When GL state is changed, internal state may have to be recomputed, introducing delays. [all]
-
Avoid using
-
If you need to draw many polygon outlines or vertex points use
glBegin
withGL_POINTS, GL_LINES, GL_LINE_LOOP
orGL_LINE_STRIP
instead as it can be much faster. [all]
glPolygonMode
3.7 Window System Integration
-
Minimize calls to the make current call
-
The
glXMakeCurrent
call, for example, can be expensive on hardware systems because the context switch may involve moving a large amount of data in and out of the hardware.
-
Visual / pixel format performance
- Some X visuals or pixel formats may be faster than others. On PCs for example, 24-bit color buffers may be slower to read/write than 12 or 8-bit buffers. There is often a tradeoff between performance and quality of frame buffer configurations. 12-bit color may not look as nice as 24-bit color. A 16-bit depth buffer won't have the precision of a 24-bit depth buffer.
-
Avoid mixing OpenGL rendering with native rendering
-
OpenGL allows both itself and the native window system to render into the same window. For this to be done correctly synchronization is needed. The GLX
glXWaitX
andglXWaitGL
functions serve this purpose.
-
Don't redraw more than necessary
- Be sure that you're not redrawing your scene unnecissarily. For example, expose/repaint events may come in batches describing separate regions of the window which must be redrawn. Since one usually redraws the whole window image with OpenGL you only need to respond to one expose/repaint event. In the case of X, look at the count field of the XExposeEvent structure. Only redraw when it is zero.
-
SwapBuffer calls and graphics pipe blocking
- On systems with 3-D graphics hardware the SwapBuffers call is synchronized to the monitor's vertical retrace. Input to the OpenGL command queue may be blocked until the buffer swap has completed. Therefore, don't put more OpenGL calls immediately after SwapBuffers. Instead, put application computation instructions which can overlap with the buffer swap delay.
3.8 Mesa-specific
-
Double buffering
- The X driver supports two back color buffer implementations: Pixmaps and XImages. The MESA_BACK_BUFFER environment variable controls which is used. Which of the two that's faster depends on the nature of your rendering. Experiment.
-
X Visuals
-
As described above, some X visuals can be rendered into more quickly than others. The
MESA_RGB_VISUAL
environment variable can be used to determine the quickest visual by experimentation.
-
Depth buffers
- Mesa may use a 16 or 32-bit depth buffer as specified in the src/config.h configuration file. 16-bit depth buffers are faster but may not offer the precision needed for all applications.
-
Flat-shaded primitives
-
If one is drawing a number of flat-shaded primitives all of the same color the
glColor
command should be put before theglBegin
call.glBegin(...);
glColor(...);
glVertex(...);
...
glEnd();
glColor(...);
glBegin(...);
glVertex(...);
...
glEnd();
-
Avoid double precision valued functions
- Mesa does all internal floating point computations in single precision floating point. API functions which take double precision floating point values must convert them to single precision. This can be expensive in the case of glVertex, glNormal, etc.
4. Evaluation and Tuning
4.1 Pipeline tuning
The graphics system can be divided into three subsystems for the purpose of performance evaluation:
- CPU subsystem - application code which drives the graphics subsystem
- Geometry subsystem - transformation of vertices, lighting, and clipping
- Rasterization subsystem - drawing filled polygons, line segments and per-pixel processing
4.1.1 CPU subsystem
4.1.2 Geometry subsystem
4.1.3 Rasterization subsystem
4.2 Double buffering
Measure the performance of rendering in single buffer mode to determine how far you really are from your target frame rate.