Return to Answer

added 98 characters in body

Source Link

edited Jan 7, 2023 at 21:37

G. Sliepen

69.3k
3
75
180

You create a SpriteParameter, and initialize it with a glm::vec4 that has 4 division operations in it. Divisions are slow; even if the throughput of 32-bit integer divisions on the latest CPUs is now just a handful of cycles, it still has tens of cycles of latency.
In loadData() you do 3 operations on a 4 by 4 matrix. That's actually a lot of floating point operations, especially because of glm::rotate(), which uses trigonometric functions behind the scenes. These operations together easily use hundreds of cycles.
A lot of data is being copiedgenerated: at least 26 floats, or 104 bytes. Remember, this needs to multiplied by 512000 to get the bandwidth per second: about 507 MB/s. And then it has to be read by the GPU as well, so you have to double that number to 1014 MB/s.

You create a SpriteParameter, and initialize it with a glm::vec4 that has 4 division operations in it. Divisions are slow; even if the throughput of 32-bit integer divisions on the latest CPUs is now just a handful of cycles, it still has tens of cycles of latency.
In loadData() you do 3 operations on a 4 by 4 matrix. That's actually a lot of floating point operations, especially because of glm::rotate(), which uses trigonometric functions behind the scenes. These operations together easily use hundreds of cycles.
A lot of data is being copied: at least 26 floats, or 104 bytes. Remember, this needs to multiplied by 512000 to get the bandwidth per second: about 507 MB/s.

You create a SpriteParameter, and initialize it with a glm::vec4 that has 4 division operations in it. Divisions are slow; even if the throughput of 32-bit integer divisions on the latest CPUs is now just a handful of cycles, it still has tens of cycles of latency.
In loadData() you do 3 operations on a 4 by 4 matrix. That's actually a lot of floating point operations, especially because of glm::rotate(), which uses trigonometric functions behind the scenes. These operations together easily use hundreds of cycles.
A lot of data is being generated: at least 26 floats, or 104 bytes. Remember, this needs to multiplied by 512000 to get the bandwidth per second: about 507 MB/s. And then it has to be read by the GPU as well, so you have to double that number to 1014 MB/s.

Source Link

answered Jan 7, 2023 at 16:09

G. Sliepen

69.3k
3
75
180

Expected performance

The code works fine but it is quite slow; I am getting about 20 frames a second when rendering 25600 copies. Really would apppreciate any insight into how to improve it.

20 fps is of course low, you want 60 fps or more to get fluid motion on screen. But it's instructive to ask yourself: what performance did you expect? That's not the same as the performance you want.

If you have 25600 objects and 20 fps, then that's 512000 objects per second to handle. Assuming roughly an 1 GHz CPU (it's probably the right order of magnitude), you have \$10^9 / 25600 / 20 \approx 2000\$ cycles per object to do whatever needs to be done. 2000 cycles seems like a lot, but let's look at what you are doing per object:

You create a SpriteParameter, and initialize it with a glm::vec4 that has 4 division operations in it. Divisions are slow; even if the throughput of 32-bit integer divisions on the latest CPUs is now just a handful of cycles, it still has tens of cycles of latency.
In loadData() you do 3 operations on a 4 by 4 matrix. That's actually a lot of floating point operations, especially because of glm::rotate(), which uses trigonometric functions behind the scenes. These operations together easily use hundreds of cycles.
A lot of data is being copied: at least 26 floats, or 104 bytes. Remember, this needs to multiplied by 512000 to get the bandwidth per second: about 507 MB/s.

Did you compile your code with optimizations enabled? If not, then that would explain a lot. If you are running on an older CPU or on a laptop or mobile phone, then even with compiler optimizations this might explain the framerate you see.

Move more work into the vertex shader

You are doing a lot of work on the CPU that could be done by the GPU. If you know you are CPU-bound, then that is what I would try to do first. Consider that the GPU can do all these matrix transformations for you. Also, when doing instanced rendering, you can use gl_InstanceID in the vertex shader to know which instance you are rending. Thus, it can even calculate the rect.

If you have certain effects, like tint varying gradually based on position and time, you can also let the vertex shader calculate that based on gl_InstanceID, and pass the time as a uniform to the shader.

Ideally, you don't have data[]; you just pass the minimum amount of uniforms to the shader to have it calculate everything you want, so the amount of calculations necessary and the amount of bandwidth needed is minimal.

If you still need per-instance data that cannot be calculated per-frame on the GPU, then at least try to minimize the amount of data that the CPU has to generate and pass to the shader.

What if you are GPU-bound?

The above assumes the bottleneck is the CPU. It could also be that it is the GPU that is the bottleneck. How big are the sprites? How many fragments does it need to render per second? How complex is the fragment shader? You could do some rough estimations again to see if your GPU can actually handle the load.

Use a profiling tool

I recommend that you use a profiling tool to figure out exactly where the bottlenecks are in your code. Linux perf is a good way to do this, assuming you are running on Linux of course. If the bottleneck is the GPU, then you might have to look for profiling tools from your GPU's vendor.