123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285 |
- .. _doc_gpu_optimization:
- GPU optimization
- ================
- Introduction
- ~~~~~~~~~~~~
- The demand for new graphics features and progress almost guarantees that you
- will encounter graphics bottlenecks. Some of these can be on the CPU side, for
- instance in calculations inside the Godot engine to prepare objects for
- rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
- sorts instructions to pass to the GPU, and in the transfer of these
- instructions. And finally, bottlenecks also occur on the GPU itself.
- Where bottlenecks occur in rendering is highly hardware-specific.
- Mobile GPUs in particular may struggle with scenes that run easily on desktop.
- Understanding and investigating GPU bottlenecks is slightly different to the
- situation on the CPU. This is because, often, you can only change performance
- indirectly by changing the instructions you give to the GPU. Also, it may be
- more difficult to take measurements. In many cases, the only way of measuring
- performance is by examining changes in the time spent rendering each frame.
- Draw calls, state changes, and APIs
- ===================================
- .. note:: The following section is not relevant to end-users, but is useful to
- provide background information that is relevant in later sections.
- Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or
- Vulkan). The communication and driver activity involved can be quite costly,
- especially in OpenGL and OpenGL ES. If we can provide these instructions in a
- way that is preferred by the driver and GPU, we can greatly increase
- performance.
- Nearly every API command in OpenGL requires a certain amount of validation to
- make sure the GPU is in the correct state. Even seemingly simple commands can
- lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
- reduce these instructions to a bare minimum and group together similar objects
- as much as possible so they can be rendered together, or with the minimum number
- of these expensive state changes.
- 2D batching
- ~~~~~~~~~~~
- In 2D, the costs of treating each item individually can be prohibitively high -
- there can easily be thousands of them on the screen. This is why 2D *batching*
- is used. Multiple similar items are grouped together and rendered in a batch,
- via a single draw call, rather than making a separate draw call for each item.
- In addition, this means state changes, material and texture changes can be kept
- to a minimum.
- For more information on 2D batching, see :ref:`doc_batching`.
- 3D batching
- ~~~~~~~~~~~
- In 3D, we still aim to minimize draw calls and state changes. However, it can be
- more difficult to batch together several objects into a single draw call. 3D
- meshes tend to comprise hundreds or thousands of triangles, and combining large
- meshes in real-time is prohibitively expensive. The costs of joining them quickly
- exceeds any benefits as the number of triangles grows per mesh. A much better
- alternative is to **join meshes ahead of time** (static meshes in relation to each
- other). This can either be done by artists, or programmatically within Godot.
- There is also a cost to batching together objects in 3D. Several objects
- rendered as one cannot be individually culled. An entire city that is off-screen
- will still be rendered if it is joined to a single blade of grass that is on
- screen. Thus, you should always take objects' location and culling into account
- when attempting to batch 3D objects together. Despite this, the benefits of
- joining static objects often outweigh other considerations, especially for large
- numbers of distant or low-poly objects.
- For more information on 3D specific optimizations, see
- :ref:`doc_optimizing_3d_performance`.
- Reuse Shaders and Materials
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~
- The Godot renderer is a little different to what is out there. It's designed to
- minimize GPU state changes as much as possible. :ref:`SpatialMaterial
- <class_SpatialMaterial>` does a good job at reusing materials that need similar
- shaders. if custom shaders are used, make sure to reuse them as much as
- possible. Godot's priorities are:
- - **Reusing Materials:** The fewer different materials in the
- scene, the faster the rendering will be. If a scene has a huge amount
- of objects (in the hundreds or thousands), try reusing the materials.
- In the worst case, use atlases to decrease the amount of texture changes.
- - **Reusing Shaders:** If materials can't be reused, at least try to re-use
- shaders. Note: shaders are automatically reused between
- SpatialMaterials that share the same configuration (features
- that are enabled or disabled with a check box) even if they have different
- parameters.
- If a scene has, for example, ``20,000`` objects with ``20,000`` different
- materials each, rendering will be slow. If the same scene has ``20,000``
- objects, but only uses ``100`` materials, rendering will be much faster.
- Pixel cost versus vertex cost
- =============================
- You may have heard that the lower the number of polygons in a model, the faster
- it will be rendered. This is *really* relative and depends on many factors.
- On a modern PC and console, vertex cost is low. GPUs originally only rendered
- triangles. This meant that every frame:
- 1. All vertices had to be transformed by the CPU (including clipping).
- 2. All vertices had to be sent to the GPU memory from the main RAM.
- Nowadays, all this is handled inside the GPU, greatly increasing performance.
- 3D artists usually have the wrong feeling about polycount performance because 3D
- DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to
- be edited, reducing actual performance. Game engines rely on the GPU more, so
- they can render many triangles much more efficiently.
- On mobile devices, the story is different. PC and console GPUs are
- brute-force monsters that can pull as much electricity as they need from
- the power grid. Mobile GPUs are limited to a tiny battery, so they need
- to be a lot more power efficient.
- To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
- when the same pixel on the screen is being rendered more than once. Imagine a
- town with several buildings. GPUs don't know what is visible and what is hidden
- until they draw it. For example, a house might be drawn and then another house
- in front of it (which means rendering happened twice for the same pixel). PC
- GPUs normally don't care much about this and just throw more pixel processors to
- the hardware to increase performance (which also increases power consumption).
- Using more power is not an option on mobile so mobile devices use a technique
- called *tile-based rendering* which divides the screen into a grid. Each cell
- keeps the list of triangles drawn to it and sorts them by depth to minimize
- *overdraw*. This technique improves performance and reduces power consumption,
- but takes a toll on vertex performance. As a result, fewer vertices and
- triangles can be processed for drawing.
- Additionally, tile-based rendering struggles when there are small objects with a
- lot of geometry within a small portion of the screen. This forces mobile GPUs to
- put a lot of strain on a single screen tile, which considerably decreases
- performance as all the other cells must wait for it to complete before
- displaying the frame.
- To summarize, don't worry about vertex count on mobile, but
- **avoid concentration of vertices in small parts of the screen**.
- If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
- a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
- avoid having triangles smaller than the size of a pixel on screen.
- Pay attention to the additional vertex processing required when using:
- - Skinning (skeletal animation)
- - Morphs (shape keys)
- - Vertex-lit objects (common on mobile)
- Pixel/fragment shaders and fill rate
- ====================================
- In contrast to vertex processing, the costs of fragment (per-pixel) shading have
- increased dramatically over the years. Screen resolutions have increased (the
- area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
- screen, that is 27x the area), but also the complexity of fragment shaders has
- exploded. Physically-based rendering requires complex calculations for each
- fragment.
- You can test whether a project is fill rate-limited quite easily. Turn off
- V-Sync to prevent capping the frames per second, then compare the frames per
- second when running with a large window, to running with a very small window.
- You may also benefit from similarly reducing your shadow map size if using
- shadows. Usually, you will find the FPS increases quite a bit using a small
- window, which indicates you are to some extent fill rate-limited. On the other
- hand, if there is little to no increase in FPS, then your bottleneck lies
- elsewhere.
- You can increase performance in a fill rate-limited project by reducing the
- amount of work the GPU has to do. You can do this by simplifying the shader
- (perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
- <class_SpatialMaterial>`), or reducing the number and size of textures used.
- **When targeting mobile devices, consider using the simplest possible shaders
- you can reasonably afford to use.**
- Reading textures
- ~~~~~~~~~~~~~~~~
- The other factor in fragment shaders is the cost of reading textures. Reading
- textures is an expensive operation, especially when reading from several
- textures in a single fragment shader. Also, consider that filtering may slow it
- down further (trilinear filtering between mipmaps, and averaging). Reading
- textures is also expensive in terms of power usage, which is a big issue on
- mobiles.
- **If you use third-party shaders or write your own shaders, try to use
- algorithms that require as few texture reads as possible.**
- Texture compression
- ~~~~~~~~~~~~~~~~~~~
- By default, Godot compresses textures of 3D models when imported using video RAM
- (VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
- JPG when stored, but increases performance enormously when drawing large enough
- textures.
- This is because the main goal of texture compression is bandwidth reduction
- between memory and the GPU.
- In 3D, the shapes of objects depend more on the geometry than the texture, so
- compression is generally not noticeable. In 2D, compression depends more on
- shapes inside the textures, so the artifacts resulting from 2D compression are
- more noticeable.
- As a warning, most Android devices do not support texture compression of
- textures with transparency (only opaque), so keep this in mind.
- .. note::
- Even in 3D, "pixel art" textures should have VRAM compression disabled as it
- will negatively affect their appearance, without improving performance
- significantly due to their low resolution.
- Post-processing and shadows
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Post-processing effects and shadows can also be expensive in terms of fragment
- shading activity. Always test the impact of these on different hardware.
- **Reducing the size of shadowmaps can increase performance**, both in terms of
- writing and reading the shadowmaps. On top of that, the best way to improve
- performance of shadows is to turn shadows off for as many lights and objects as
- possible. Smaller or distant OmniLights/SpotLights can often have their shadows
- disabled with only a small visual impact.
- Transparency and blending
- =========================
- Transparent objects present particular problems for rendering efficiency. Opaque
- objects (especially in 3D) can be essentially rendered in any order and the
- Z-buffer will ensure that only the front most objects get shaded. Transparent or
- blended objects are different. In most cases, they cannot rely on the Z-buffer
- and must be rendered in "painter's order" (i.e. from back to front) to look
- correct.
- Transparent objects are also particularly bad for fill rate, because every item
- has to be drawn even if other transparent objects will be drawn on top
- later on.
- Opaque objects don't have to do this. They can usually take advantage of the
- Z-buffer by writing to the Z-buffer only first, then only performing the
- fragment shader on the "winning" fragment, the object that is at the front at a
- particular pixel.
- Transparency is particularly expensive where multiple transparent objects
- overlap. It is usually better to use transparent areas as small as possible to
- minimize these fill rate requirements, especially on mobile, where fill rate is
- very expensive. Indeed, in many situations, rendering more complex opaque
- geometry can end up being faster than using transparency to "cheat".
- Multi-platform advice
- =====================
- If you are aiming to release on multiple platforms, test *early* and test
- *often* on all your platforms, especially mobile. Developing a game on desktop
- but attempting to port it to mobile at the last minute is a recipe for disaster.
- In general, you should design your game for the lowest common denominator, then
- add optional enhancements for more powerful platforms. For example, you may want
- to use the GLES2 backend for both desktop and mobile platforms where you target
- both.
- Mobile/tiled renderers
- ======================
- As described above, GPUs on mobile devices work in dramatically different ways
- from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
- split up the screen into regular-sized tiles that fit into super fast cache
- memory, which reduces the number of read/write operations to the main memory.
- There are some downsides though. Tiled rendering can make certain techniques
- much more complicated and expensive to perform. Tiles that rely on the results
- of rendering in different tiles or on the results of earlier operations being
- preserved can be very slow. Be very careful to test the performance of shaders,
- viewport textures and post processing.
|