Skip to content

Conversation

@lrahmann
Copy link
Contributor

@lrahmann lrahmann commented May 25, 2025

This PR refactors/improves the RenderingDeviceDriverD3D12::DescriptorsHeap implementation by:

  • Seperating RenderingDeviceDriverD3D12::DescriptorsHeap into RenderingDeviceDriverD3D12::CPUDescriptorsHeap and RenderingDeviceDriverD3D12::GPUDescriptorsHeap
  • Implementing a fixed size pool for RenderingDeviceDriverD3D12::CPUDescriptorsHeap to prevent RenderingDeviceDriverD3D12::CPUDescriptorsHeap::allocate from failing.
  • Adding a "fallback" CPUDescriptorsHeap if the allocate ever fails. While this is not good, its better than the slow heap corruption which will happen otherwise. This may be optional as the above fixed the issue from my side.

With this CPUDescriptorsHeap will be allocated with a minimal size of 1024 elements, which is then spread in fixed size blocks.
This massively reduces the number of ID3D12DescriptorHeap allocated.

This addresses descriptor allocation failures and heap corruption in D3D12, specifically fixing issues like #102463 and #103501. The primary issue was DescriptorsHeap::allocate failing for CPU only descriptor heaps, which this PR solves.
It is somewhat inspired by the SDL3 way of just allocating one CPUDescriptorHeap with 1024 entries and distributing this.

While my minimal example in #102463 seems to work now even without this PR, local game testing confirms these changes prevent the allocation failures and subsequent corruption. I will try to make another example if this is needed.

As I am neither a C++ or DirectX expert/developer, further review and refinement by experienced developers would be appreciated.

_Bugsquad edit: Fixes: #102463 Fixes #103501 _

@lrahmann lrahmann requested a review from a team as a code owner May 25, 2025 23:13
@clayjohn clayjohn added this to the 4.x milestone May 25, 2025
@clayjohn clayjohn requested a review from DarioSamo May 25, 2025 23:38
@lrahmann lrahmann force-pushed the d3d12_cpu_descriptor_heap branch from 712bb2a to 50856ab Compare May 26, 2025 09:07
@DarioSamo
Copy link
Contributor

The primary issue was DescriptorsHeap::allocate failing for CPU only descriptor heaps

What was the reason behind the failure?

@lrahmann
Copy link
Contributor Author

The primary issue was DescriptorsHeap::allocate failing for CPU only descriptor heaps

What was the reason behind the failure?

I honestly have no idea. Probably just the fact that nobody else is doing this so they don't test their drivers accordingly? We do have crashes with this stack traces across all vendors.

When debugging it locally the nvidia driver is spewing E_OUT_OF_MEMORY exceptions whenever you try to allocate a new DescriptorHeap (non shader visible) after hitting this limit. I can send you driver stacktraces if you are intrested in those.

This then results in broken uniform buffers and their likes, and somewhere later a segmentation fault.
If needed I can try to make another minimal viable example.
Getting it to crash needs a somewhat more complex scene, as this doesn't happen directly.

But this is really affecting us, preventing us from using DirectX as a default backend.

@DarioSamo
Copy link
Contributor

DarioSamo commented May 26, 2025

When debugging it locally the nvidia driver is spewing E_OUT_OF_MEMORY exceptions whenever you try to allocate a new DescriptorHeap (non shader visible) after hitting this limit. I can send you driver stacktraces if you are intrested in those.

I think the part that bugs me about it is that non-shader visible descriptor heaps are mostly CPU-side structures. Would the reason be that the driver is unable to create heaps past a certain amount (e.g. not the total amount of descriptors but rather, the amount of heaps)? It'd be easy to verify this if you just allocate them linearly until it fails during initialization.

@lrahmann
Copy link
Contributor Author

@DarioSamo

it bugs me aswell, but the exception does come from one of the nvidia_driver dlls. That is probably the reason why its not documented anywhere.

If you check out 4.3 the example from me is more or less doing this in #102463. Its basically a few line GDScript creating new GPUParticle3Ds, after a certain count these errors will pop up on the d3d12 backend. This was fixed with 4.5.master but I'm not sure why, probably because some optimization is preventing the uniform sets from being created.

I don't know enough of modern graphic apis to build an example outside of godot.

@lrahmann
Copy link
Contributor Author

When debugging it locally the nvidia driver is spewing E_OUT_OF_MEMORY exceptions whenever you try to allocate a new DescriptorHeap (non shader visible) after hitting this limit. I can send you driver stacktraces if you are intrested in those.

I think the part that bugs me about it is that non-shader visible descriptor heaps are mostly CPU-side structures. Would the reason be that the driver is unable to create heaps past a certain amount (e.g. not the total amount of descriptors but rather, the amount of heaps)? It'd be easy to verify this if you just allocate them linearly until it fails during initialization.

To reply to your edit, its definitly amount of heaps no amount descriptors. Basically this PR reduces the amount of heaps we need to allocate in total but increases the amount of descriptors. In fact I first tried simply pooling the heaps instead of recreating them all the time, that did not fix the issue.

If we do not need linear CPU descriptor sets later on we can make this more optimal by distributing the DescriptorHandles seperatly but I'm not sure what perfomance impact that would entail.

@lrahmann lrahmann force-pushed the d3d12_cpu_descriptor_heap branch 2 times, most recently from 4bed1c2 to 63e302f Compare May 27, 2025 14:51
@DarioSamo
Copy link
Contributor

DarioSamo commented May 27, 2025

It took me a while but I do get the scheme the proposed allocator goes for. That said, it seems like it leads to a lot of potential fragmentation and wasted descriptor space if the numbers don't happen to line up well to powers of two, which I can see happening pretty often across the codebase as uniform sets have irregular sizes most of the time.

I don't think we create uniform sets often enough to warrant an allocator that optimizes so much for speed so I think we should go with a solution that prioritizes packing descriptors together more, especially if the amount of total descriptor heaps we're allowed is limited.

@DarioSamo
Copy link
Contributor

DarioSamo commented May 27, 2025

Here's some bonus reading on the topic from how Diligent Engine handles this:

https://diligentgraphics.com/diligent-engine/architecture/d3d12/managing-descriptor-heaps/

The part that might be of interest in particular is the allocator:

https://diligentgraphics.com/diligent-engine/architecture/d3d12/variable-size-memory-allocations-manager/

@lrahmann
Copy link
Contributor Author

lrahmann commented May 27, 2025

@DarioSamo
Alternative to that would be to go down the SDL route and simply only hand out single ones.

I can also look into implementing a variable size allocation manager there but only if there is a realistic chance of this getting merged afterwards. As this creates a way more complex data structure, and we don't really care about the extra memory used here (from my measurements for our game the additional memory needs are negligible).

I also rechecked the minimal example from #102463 using the current master, in fact is still printing an error message for me, but only if I pipe stdout somewhere else.

@DarioSamo
Copy link
Contributor

DarioSamo commented May 27, 2025

As this creates a way more complex data structure, and we don't really care about the extra memory used here

My concern isn't really the memory used up by it, just that we're dealing with an unknown amount of limited descriptors handled by the driver and that users will keep finding ways to run into that limit, so we may have to go back to this at some point if we fail to optimize for the problem properly: the problem being that the amount of descriptors, either total or heaps, seems to be limited and we can't know what it is until we hit it.

but only if there is a realistic chance of this getting merged afterwards

While I can't guarantee that, I'd say there's a high chance considering it fixes a critical runtime error.

@lrahmann lrahmann force-pushed the d3d12_cpu_descriptor_heap branch from 63e302f to 918b861 Compare May 30, 2025 09:00
@lrahmann
Copy link
Contributor Author

lrahmann commented May 30, 2025

@DarioSamo

pushed a version which uses RBTrees to create a structure similar to the one described in your referenced post.
It does allocate multiple buffers within a unified global address space instead of separating these.
Additionally it currently does not free no longer needed heaps. The latter could easily be added.
The editor is currently prone to crash for me under DirectX both for this commit but also on the master branch.
Im not sure how to handle failure withing the data structure, but if any of the crash conditions is true, the code would fail later on.

@DarioSamo
Copy link
Contributor

Where does it crash under this latest commit?

@lrahmann
Copy link
Contributor Author

Where does it crash under this latest commit?
It crashes under the master commit as well so I don't think its related to my PR.
I just retested this with the latest commit b89c47b.
It crashes for example when right clicking on a node in the scene tree.

================================================================
CrashHandlerException: Program crashed
Engine version: Godot Engine v4.5.dev.mono.custom_build (b89c47b)
Dumping the backtrace. Please include this when reporting the bug on: https://github.com/godotengine/godot/issues
[0] RenderingDeviceDriverD3D12::command_buffer_begin (D:\GodotBuild\godotRecentClean\drivers\d3d12\rendering_device_driver_d3d12.cpp:2455)
[1] RenderingDeviceDriverD3D12::command_buffer_begin (D:\GodotBuild\godotRecentClean\drivers\d3d12\rendering_device_driver_d3d12.cpp:2455)
[2] RenderingDeviceGraph::_run_render_commands (D:\GodotBuild\godotRecentClean\servers\rendering\rendering_device_graph.cpp:1061)
[3] RenderingDeviceGraph::end (D:\GodotBuild\godotRecentClean\servers\rendering\rendering_device_graph.cpp:2408)
[4] RenderingDevice::_end_frame (D:\GodotBuild\godotRecentClean\servers\rendering\rendering_device.cpp:6453)
[5] RenderingDevice::swap_buffers (D:\GodotBuild\godotRecentClean\servers\rendering\rendering_device.cpp:6260)
[6] RenderingServerDefault::_draw (D:\GodotBuild\godotRecentClean\servers\rendering\rendering_server_default.cpp:91)
[7] Main::iteration (D:\GodotBuild\godotRecentClean\main\main.cpp:4777)
[8] OS_Windows::run (D:\GodotBuild\godotRecentClean\platform\windows\os_windows.cpp:2278)
[9] widechar_main (D:\GodotBuild\godotRecentClean\platform\windows\godot_windows.cpp:100)
[10] _main (D:\GodotBuild\godotRecentClean\platform\windows\godot_windows.cpp:122)
[11] WinMain (D:\GodotBuild\godotRecentClean\platform\windows\godot_windows.cpp:149)
[12] __scrt_common_main_seh (D:\a_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288)
[13] <couldn't map PC to fn name>
-- END OF C++ BACKTRACE --

@DarioSamo
Copy link
Contributor

DarioSamo commented May 30, 2025

As far as I can tell, under the current implementation, it seems like it's just unable to allocate more than 1024 descriptors per type or whatever size might be bigger if an allocation reached it first that was bigger than that.

Once that's done, it has no way to grow it, so it makes sense you reach the limit pretty quickly and end up crashing.

@lrahmann
Copy link
Contributor Author

@DarioSamo
see my comment, I can allocate new blocks. Its just done in chunks of 1024.
I tested this PR our game and against the minimal example from the issue before. It allocates a lot more than 1024 handles.

I just checked again the error being spammed to the console is
ERROR: Can't create buffer of size: 768, error 0x887a0005.
at: (drivers\d3d12\rendering_device_driver_d3d12.cpp:844)
the crash happens afterwards.
This happens on master and on this PR.
Error 0x887a0005 is dxgi_error_device_removed which to seems like something else messed up with the rendering pipeline before.

@DarioSamo
Copy link
Contributor

Yeah my bad, I understood it now. It threw me off a bit there's no mechanism for storing and freeing the heaps afterwards so I was under the impression there was only one of them.

@lrahmann
Copy link
Contributor Author

lrahmann commented May 30, 2025

No issue,
I dug a bit deeper and for reference here are the constants for DiligentEngine
https://github.com/DiligentGraphics/DiligentCore/blob/e8f871737f33619de5ceec7474e6906131472c5b/Graphics/GraphicsEngine/interface/GraphicsTypes.h#L3763
They split this number between the different heap types.

Also they do create a singular manager per heap, whereas I just create one manager, but create the heaps in a way that their address space has holes in between them. This could be done in a nicer fashion probably, I just didn't want to manage the lifecycle for yet another object.

@lrahmann
Copy link
Contributor Author

Okay my bad, the crash on master was related to a version mismatch in the agility sdk.
Now both editors run fine.

@DarioSamo
Copy link
Contributor

Seems alright then. Will you be polishing this to pass CI and do the necessary cleanup on shutdown? Or do you want a review for those bits?

@lrahmann
Copy link
Contributor Author

will fix the issue with the CI asap.
there are a few open questions from my side:

  • Is crashing inside the functions a good way to fail on error messages
  • When trying to clean up the data structure, it seems like all the heaps are destroyed when the ID3D12Device gets destroyed. At least calling ->Release() on those results in an exception, when inside the destructor.
  • Do we want to deallocate DescriptorHeaps no longer in use ?
@lrahmann lrahmann force-pushed the d3d12_cpu_descriptor_heap branch from 918b861 to 0769d59 Compare May 30, 2025 17:55
@DarioSamo
Copy link
Contributor

DarioSamo commented May 30, 2025

Is crashing inside the functions a good way to fail on error messages

The preferred way is to use the error macros (ERR_FAIL, ERR_FAIL_COND, etc).

When trying to clean up the data structure, it seems like all the heaps are destroyed when the ID3D12Device gets destroyed. At least calling ->Release() on those results in an exception, when inside the destructor.

The application should be properly releasing all uniform sets so this doesn't happen, see next point.

Do we want to deallocate DescriptorHeaps no longer in use ?

Yes, Godot actively uses manual resource management. You'll get an explicit free for every uniform set. When uniform sets are deleted, they should deallocate their respective chunk from the descriptor heap pool. When the pool is free, we can choose to also delete the heap if it no longer has any allocations.

The pools should be freed by the driver when shutting down and we can issue a warning (there's another macro for this) if allocations are still present in the pool despite reaching the shutdown time.

@lrahmann
Copy link
Contributor Author

lrahmann commented May 30, 2025

Yes, Godot actively uses manual resource management. You'll get an explicit free for every uniform set. When uniform sets are deleted, they should deallocate their respective chunk from the descriptor heap pool. When the pool is free, we can choose to also delete the heap if it no longer has any allocations.

There are also a few other things using CPU bound descriptor heaps not only uniform sets.
I currently manage all of them in a single pool by using a unified global address space.

The pools should be freed by the driver when shutting down and we can issue a warning (there's another macro for this) if allocations are still present in the pool despite reaching the shutdown time.

This already happens, the only thing left to clean up are the D3D12DescriptorHeap* objects. As I currently do not free unused DescriptorHeaps like before. This can simply be added.

Anyway give it a review in its current state, and I'll check back next week.
Thank you for your effort.

@DarioSamo
Copy link
Contributor

DarioSamo commented Jun 12, 2025

I gave it another look and I'm a bit confused by the usage of the concept of a global offset. I figured it'd be more straightforward to just store the block indices along with a vector of the currently allocated blocks and reference them directly that way instead of needing to access the map and do a search for the closest offset.

It's not exactly a performance concern, just a bit of a barrier to understanding the code and how it lays out the descriptors across the allocated blocks, because the "global offsets" are also a bit arbitrarily placed and all the blocks seem to share the same map for the free blocks according to this.

Apart from that, it seems there's work pending to clean up the comments and implement proper error handling, as there's a lot of crash conditions remaining, which are not usually how we handle errors. If you wish to catch errors with the implementation that should never happen at runtime if the code is correct, DEV_ASSERT is preferred. Under other circumstances, it should use the error macros and propagate it accordingly.

@lrahmann
Copy link
Contributor Author

Hey Dario,

thanks for taking time to check, I have little availability so excuse the late reply.

I gave it another look and I'm a bit confused by the usage of the concept of a global offset. I figured it'd be more straightforward to just store the block indices along with a vector of the currently allocated blocks and reference them directly that way instead of needing to access the map and do a search for the closest offset.

Not sure what you mean, you would still need to search for the closest offset because blocks can be broken down, so the code would look almost the same there would just be a set of pools per type instead of a single one. This is what DilligentEngine is doing in documentation you pointed me at. See https://diligentgraphics.com/diligent-engine/architecture/d3d12/managing-descriptor-heaps/#CPU-Descriptor-Heap
For clarity I can also to a page table like approach where e.g. the lower 10 bits will always be within the same block.
But this would mean that we need a maximum allocation size.

It's not exactly a performance concern, just a bit of a barrier to understanding the code and how it lays out the descriptors across the allocated blocks, because the "global offsets" are also a bit arbitrarily placed and all the blocks seem to share the same map for the free blocks according to this.

It's actually slightly more performant that way because we can skip the linear search trough all allocations to see which one has free blocks. But that should have no impact because I don't expect there to be more than a few of these.

Apart from that, it seems there's work pending to clean up the comments and implement proper error handling, as there's a lot of crash conditions remaining, which are not usually how we handle errors. If you wish to catch errors with the implementation that should never happen at runtime if the code is correct, DEV_ASSERT is preferred. Under other circumstances, it should use the error macros and propagate it accordingly.

Yeah that was what my question before was referring to, if there are any guidelines on how to handle error cases. Almost all of the error cases I have mean that the data structure was likely corrupted and can't be recovered. These could be under a DEV_ASSERT. Open question is what the expected behaviour of a failed allocation should be, as the current one in master will lead to heap corruption later on.

@lrahmann
Copy link
Contributor Author

Hey @DarioSamo ,

is this fix still intended to be merged? It has been running on our export template for quite a while now.

@DarioSamo
Copy link
Contributor

Hey @DarioSamo ,

is this fix still intended to be merged? It has been running on our export template for quite a while now.

We definitely should have the fix, but the PR itself definitely needs another pass to clean up the commented code, adjust the code to the style and remove the usage of crash conditions. It's not how we traditionally handle errors.

  • Use DEV_ASSERT to catch errors that could happen from implementation errors and you want to be sure to catch pre-conditions or situations that should never happen if the code is implemented correctly.
  • Use the error macros to propagate and handle failures that could happen at runtime. We don't really use crashes for those scenarios.
@DarioSamo
Copy link
Contributor

I'm in need of this PR as well. Do you mind if I take it over and finish polishing the points I brought up?

@lrahmann
Copy link
Contributor Author

lrahmann commented Sep 2, 2025

@DarioSamo
I don't mind, I can probably finish this off end of this week/ next week but not sure if I'll manage to fix all the issues you had. Sorry for the delay.
Lukas

@DarioSamo
Copy link
Contributor

@DarioSamo I don't mind, I can probably finish this off end of this week/ next week but not sure if I'll manage to fix all the issues you had. Sorry for the delay. Lukas

No worries, I'll take a look at it tomorrow and push the changes I wanted directly.

@DarioSamo DarioSamo force-pushed the d3d12_cpu_descriptor_heap branch from 0769d59 to feac3fd Compare September 3, 2025 14:41
@DarioSamo
Copy link
Contributor

I rebased and pushed an extra commit on top to indicate the changes I wanted to add.

I'll do a functionality review next to make sure there's no implementation errors or if we can simplify it a bit. If you're okay with the changes, I'll just rebase and squash into a single commit.

Copy link
Contributor

@DarioSamo DarioSamo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the PR should be in a good spot now.

The PR is pretty essential to using bigger projects with D3D12, so its merging shouldn't be too delayed, but it is not critical if we deem it too big of a change for 4.5.0. If it can't make it on this milestone, it should be considered for 4.5.1.

@DarioSamo
Copy link
Contributor

I'll wait for OP to confirm whether the changes are fine and squash it.

@clayjohn clayjohn modified the milestones: 4.x, 4.6 Sep 3, 2025
@clayjohn clayjohn added the cherrypick:4.5 Considered for cherry-picking into a future 4.5.x release label Sep 3, 2025
@lrahmann
Copy link
Contributor Author

lrahmann commented Sep 3, 2025

Seems good, thanks for your help. You can squash if you like.

@DarioSamo DarioSamo force-pushed the d3d12_cpu_descriptor_heap branch from 8d402c2 to f7fd659 Compare September 4, 2025 12:18
@DarioSamo
Copy link
Contributor

Should be squashed now.

@Repiteo Repiteo merged commit 5f5b214 into godotengine:master Sep 16, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Sep 16, 2025

Thanks! Congratulations on your first merged contribution! 🎉

@Repiteo
Copy link
Contributor

Repiteo commented Sep 16, 2025

Cherry-picked to 4.5

@Repiteo Repiteo removed the cherrypick:4.5 Considered for cherry-picking into a future 4.5.x release label Sep 16, 2025
ProfessorOctopus pushed a commit to ProfessorOctopus/godot that referenced this pull request Sep 17, 2025
…tor_heap

DirectX DescriptorsHeap pooling on CPU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment