-
-
Notifications
You must be signed in to change notification settings - Fork 23.8k
DirectX DescriptorsHeap pooling on CPU #106809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DirectX DescriptorsHeap pooling on CPU #106809
Conversation
712bb2a to
50856ab
Compare
What was the reason behind the failure? |
I honestly have no idea. Probably just the fact that nobody else is doing this so they don't test their drivers accordingly? We do have crashes with this stack traces across all vendors. When debugging it locally the nvidia driver is spewing E_OUT_OF_MEMORY exceptions whenever you try to allocate a new DescriptorHeap (non shader visible) after hitting this limit. I can send you driver stacktraces if you are intrested in those. This then results in broken uniform buffers and their likes, and somewhere later a segmentation fault. But this is really affecting us, preventing us from using DirectX as a default backend. |
I think the part that bugs me about it is that non-shader visible descriptor heaps are mostly CPU-side structures. Would the reason be that the driver is unable to create heaps past a certain amount (e.g. not the total amount of descriptors but rather, the amount of heaps)? It'd be easy to verify this if you just allocate them linearly until it fails during initialization. |
|
it bugs me aswell, but the exception does come from one of the nvidia_driver dlls. That is probably the reason why its not documented anywhere. If you check out 4.3 the example from me is more or less doing this in #102463. Its basically a few line GDScript creating new GPUParticle3Ds, after a certain count these errors will pop up on the d3d12 backend. This was fixed with 4.5.master but I'm not sure why, probably because some optimization is preventing the uniform sets from being created. I don't know enough of modern graphic apis to build an example outside of godot. |
To reply to your edit, its definitly amount of heaps no amount descriptors. Basically this PR reduces the amount of heaps we need to allocate in total but increases the amount of descriptors. In fact I first tried simply pooling the heaps instead of recreating them all the time, that did not fix the issue. If we do not need linear CPU descriptor sets later on we can make this more optimal by distributing the DescriptorHandles seperatly but I'm not sure what perfomance impact that would entail. |
4bed1c2 to
63e302f
Compare
|
It took me a while but I do get the scheme the proposed allocator goes for. That said, it seems like it leads to a lot of potential fragmentation and wasted descriptor space if the numbers don't happen to line up well to powers of two, which I can see happening pretty often across the codebase as uniform sets have irregular sizes most of the time. I don't think we create uniform sets often enough to warrant an allocator that optimizes so much for speed so I think we should go with a solution that prioritizes packing descriptors together more, especially if the amount of total descriptor heaps we're allowed is limited. |
|
Here's some bonus reading on the topic from how Diligent Engine handles this: https://diligentgraphics.com/diligent-engine/architecture/d3d12/managing-descriptor-heaps/ The part that might be of interest in particular is the allocator: |
|
@DarioSamo I can also look into implementing a variable size allocation manager there but only if there is a realistic chance of this getting merged afterwards. As this creates a way more complex data structure, and we don't really care about the extra memory used here (from my measurements for our game the additional memory needs are negligible). I also rechecked the minimal example from #102463 using the current master, in fact is still printing an error message for me, but only if I pipe stdout somewhere else. |
My concern isn't really the memory used up by it, just that we're dealing with an unknown amount of limited descriptors handled by the driver and that users will keep finding ways to run into that limit, so we may have to go back to this at some point if we fail to optimize for the problem properly: the problem being that the amount of descriptors, either total or heaps, seems to be limited and we can't know what it is until we hit it.
While I can't guarantee that, I'd say there's a high chance considering it fixes a critical runtime error. |
63e302f to
918b861
Compare
|
pushed a version which uses RBTrees to create a structure similar to the one described in your referenced post. |
|
Where does it crash under this latest commit? |
================================================================ |
|
As far as I can tell, under the current implementation, it seems like it's just unable to allocate more than 1024 descriptors per type or whatever size might be bigger if an allocation reached it first that was bigger than that. Once that's done, it has no way to grow it, so it makes sense you reach the limit pretty quickly and end up crashing. |
|
@DarioSamo I just checked again the error being spammed to the console is |
|
Yeah my bad, I understood it now. It threw me off a bit there's no mechanism for storing and freeing the heaps afterwards so I was under the impression there was only one of them. |
|
No issue, Also they do create a singular manager per heap, whereas I just create one manager, but create the heaps in a way that their address space has holes in between them. This could be done in a nicer fashion probably, I just didn't want to manage the lifecycle for yet another object. |
|
Okay my bad, the crash on master was related to a version mismatch in the agility sdk. |
|
Seems alright then. Will you be polishing this to pass CI and do the necessary cleanup on shutdown? Or do you want a review for those bits? |
|
will fix the issue with the CI asap.
|
918b861 to
0769d59
Compare
The preferred way is to use the error macros (ERR_FAIL, ERR_FAIL_COND, etc).
The application should be properly releasing all uniform sets so this doesn't happen, see next point.
Yes, Godot actively uses manual resource management. You'll get an explicit free for every uniform set. When uniform sets are deleted, they should deallocate their respective chunk from the descriptor heap pool. When the pool is free, we can choose to also delete the heap if it no longer has any allocations. The pools should be freed by the driver when shutting down and we can issue a warning (there's another macro for this) if allocations are still present in the pool despite reaching the shutdown time. |
There are also a few other things using CPU bound descriptor heaps not only uniform sets.
This already happens, the only thing left to clean up are the D3D12DescriptorHeap* objects. As I currently do not free unused DescriptorHeaps like before. This can simply be added. Anyway give it a review in its current state, and I'll check back next week. |
|
I gave it another look and I'm a bit confused by the usage of the concept of a global offset. I figured it'd be more straightforward to just store the block indices along with a vector of the currently allocated blocks and reference them directly that way instead of needing to access the map and do a search for the closest offset. It's not exactly a performance concern, just a bit of a barrier to understanding the code and how it lays out the descriptors across the allocated blocks, because the "global offsets" are also a bit arbitrarily placed and all the blocks seem to share the same map for the free blocks according to this. Apart from that, it seems there's work pending to clean up the comments and implement proper error handling, as there's a lot of crash conditions remaining, which are not usually how we handle errors. If you wish to catch errors with the implementation that should never happen at runtime if the code is correct, |
|
Hey Dario, thanks for taking time to check, I have little availability so excuse the late reply.
Not sure what you mean, you would still need to search for the closest offset because blocks can be broken down, so the code would look almost the same there would just be a set of pools per type instead of a single one. This is what DilligentEngine is doing in documentation you pointed me at. See https://diligentgraphics.com/diligent-engine/architecture/d3d12/managing-descriptor-heaps/#CPU-Descriptor-Heap
It's actually slightly more performant that way because we can skip the linear search trough all allocations to see which one has free blocks. But that should have no impact because I don't expect there to be more than a few of these.
Yeah that was what my question before was referring to, if there are any guidelines on how to handle error cases. Almost all of the error cases I have mean that the data structure was likely corrupted and can't be recovered. These could be under a DEV_ASSERT. Open question is what the expected behaviour of a failed allocation should be, as the current one in master will lead to heap corruption later on. |
|
Hey @DarioSamo , is this fix still intended to be merged? It has been running on our export template for quite a while now. |
We definitely should have the fix, but the PR itself definitely needs another pass to clean up the commented code, adjust the code to the style and remove the usage of crash conditions. It's not how we traditionally handle errors.
|
|
I'm in need of this PR as well. Do you mind if I take it over and finish polishing the points I brought up? |
|
@DarioSamo |
No worries, I'll take a look at it tomorrow and push the changes I wanted directly. |
0769d59 to
feac3fd
Compare
|
I rebased and pushed an extra commit on top to indicate the changes I wanted to add. I'll do a functionality review next to make sure there's no implementation errors or if we can simplify it a bit. If you're okay with the changes, I'll just rebase and squash into a single commit. |
DarioSamo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the PR should be in a good spot now.
The PR is pretty essential to using bigger projects with D3D12, so its merging shouldn't be too delayed, but it is not critical if we deem it too big of a change for 4.5.0. If it can't make it on this milestone, it should be considered for 4.5.1.
|
I'll wait for OP to confirm whether the changes are fine and squash it. |
|
Seems good, thanks for your help. You can squash if you like. |
8d402c2 to
f7fd659
Compare
|
Should be squashed now. |
|
Thanks! Congratulations on your first merged contribution! 🎉 |
|
Cherry-picked to 4.5 |
…tor_heap DirectX DescriptorsHeap pooling on CPU
This PR refactors/improves the RenderingDeviceDriverD3D12::DescriptorsHeap implementation by:
RenderingDeviceDriverD3D12::DescriptorsHeapintoRenderingDeviceDriverD3D12::CPUDescriptorsHeapandRenderingDeviceDriverD3D12::GPUDescriptorsHeapRenderingDeviceDriverD3D12::CPUDescriptorsHeapto preventRenderingDeviceDriverD3D12::CPUDescriptorsHeap::allocatefrom failing.With this CPUDescriptorsHeap will be allocated with a minimal size of 1024 elements, which is then spread in fixed size blocks.
This massively reduces the number of
ID3D12DescriptorHeapallocated.This addresses descriptor allocation failures and heap corruption in D3D12, specifically fixing issues like #102463 and #103501. The primary issue was
DescriptorsHeap::allocatefailing for CPU only descriptor heaps, which this PR solves.It is somewhat inspired by the SDL3 way of just allocating one CPUDescriptorHeap with 1024 entries and distributing this.
While my minimal example in #102463 seems to work now even without this PR, local game testing confirms these changes prevent the allocation failures and subsequent corruption. I will try to make another example if this is needed.
As I am neither a C++ or DirectX expert/developer, further review and refinement by experienced developers would be appreciated.
_Bugsquad edit: Fixes: #102463 Fixes #103501 _