The Wayback Machine - https://web.archive.org/web/20221201232750/https://github.com/godotengine/godot/pull/67680
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix physics/3d/run_on_separate_thread race condition in WorkerThreadPool (crash). #67680

Merged

Conversation

haasanen
Copy link
Contributor

@haasanen haasanen commented Oct 20, 2022

Fixes #67333, fixes #63671, fixes #63879
May also fix this: #61250

This mutex request fixes a race condition in WorkerThreadPool as far as I can tell. The race condition causes an eventual segmentation fault (crash). The comment on the groups.erase(p_group); is only true when physics/3d/run_on_separate_thread option in the project settings is disabled.

For example, add the following line next to groups.erase(p_group);:

OS::get_singleton()->print("Thread id: %u\n", Thread::get_caller_id()); // Debug line
groups.erase(p_group); // Threads do not access this, so safe to erase here.

If you have physics/3d/run_on_separate_thread disabled, it prints only a single thread id:

...
Thread id: 3401590889
Thread id: 3401590889
Thread id: 3401590889
Thread id: 3401590889
Thread id: 3401590889
Thread id: 3401590889
...

If you enable physics/3d/run_on_separate_thread, it prints two thread ids:

...
Thread id: 3840933713
Thread id: 3840933713
Thread id: 1959143244
Thread id: 3840933713
Thread id: 3840933713
Thread id: 1959143244
Thread id: 3840933713
Thread id: 3840933713
...

The same is true if you enable the 2D thread in physics/2d/run_on_separate_thread. If you have separate threads enabled for both 2D and 3D, there are still only two thread ids printed.

I understand that this might not be the ideal place to fix the race condition, as the comment on the line suggests that there should not be multiple threads accessing it. However, this does fix the UI crashing in 4.0 beta 3 (or since #63354).

To test this:

  1. Create empty project and enable physics/3d/run_on_separate_thread in the settings.
  2. Save an empty scene with Control node as root.
  3. Run project.
  4. To speed up crashing (could be a placebo): move your mouse on and off the window. This will create internal events. Within a minute or so, your program crashes without this PR fix.

Or you can run any existing project with the physics/3d/run_on_separate_thread enabled and it will crash within minutes without this PR fix.

I'd really appreciate if someone else also verified this as it works on my machine™ .

…Physics 2D or 3D is selected to run on a separate thread.
@haasanen haasanen requested a review from a team as a code owner Oct 20, 2022
@Calinou Calinou added this to the 4.0 milestone Oct 20, 2022
@rburing
Copy link
Member

rburing commented Oct 21, 2022

To me the name run_on_separate_thread suggests that only one non-main thread should be starting those worker thread pools. Using your debug code, can you check if one of the threads is the main thread? If so, we should probably prevent the main thread from starting the worker thread pools in the first place. (I'm no expert on threads, so take my comment with a grain of salt.)

@haasanen
Copy link
Contributor Author

haasanen commented Oct 21, 2022

To me the name run_on_separate_thread suggests that only one non-main thread should be starting those worker thread pools. Using your debug code, can you check if one of the threads is the main thread? If so, we should probably prevent the main thread from starting the worker thread pools in the first place. (I'm no expert on threads, so take my comment with a grain of salt.)

I modified the debug print statement to:

OS::get_singleton()->print("Thread id: %u, Main id: %u, is main: %s\n",
		Thread::get_caller_id(),
		Thread::get_main_id(),
		Thread::get_main_id() == Thread::get_caller_id() ? "yes" : "no"); // Debug line

With run_on_separate_thread disabled for both 3D and 2D:

...
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 2197952072, Main id: 2197952072, is main: yes
...

With run_on_separate_thread enabled only for 3D:

...
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 2197952072, Main id: 2197952072, is main: yes
...

With run_on_separate_thread enabled only for 2D:

...
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 2197952072, Main id: 2197952072, is main: yes
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 2197952072, Main id: 2197952072, is main: yes
...

With run_on_separate_thread enabled for both 3D and 2D:

...
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 1959143244, Main id: 2197952072, is main: no
Thread id: 1959143244, Main id: 2197952072, is main: no
Thread id: 1959143244, Main id: 2197952072, is main: no
Thread id: 1959143244, Main id: 2197952072, is main: no
Thread id: 3401590889, Main id: 2197952072, is main: no
Thread id: 3401590889, Main id: 2197952072, is main: no
...
@jtnicholl
Copy link
Contributor

jtnicholl commented Oct 25, 2022

I've tested and can confirm this does fix #67333 for me.

@haasanen
Copy link
Contributor Author

haasanen commented Oct 25, 2022

I've tested and can confirm this does fix #67333 for me.

Thanks for testing and confirming!

lyuma
lyuma previously requested changes Nov 26, 2022
Copy link
Contributor

@lyuma lyuma left a comment

Argument in favor of the change: every other access to the groups array was guarded by task_mutex, except this one. The PR locks task_mutex around the last and final access.

In the original change, @reduz left a comment // Threads do not access this, so safe to erase here. Why, then, was the other access to the groups HashMap in this function already guarded by the task_mutex lock. I think that inconsistency implies there was a mistake in this function, but I have some questions about the proposed fix. See my comment.

groups.erase(p_group); // Threads do not access this, so safe to erase here.
task_mutex.lock(); // This mutex is needed when Physics 2D and/or 3D is selected to run on a separate thread.
groups.erase(p_group);
task_mutex.unlock();
Copy link
Contributor

@lyuma lyuma Nov 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think performance could be improved if that groups.erase(p_group); is moved within the matching task_mutex.lock() : this will remove possible overhead from doing two locks in a row.

	if (group->low_priority_native_tasks.size() > 0) {
		...
		task_mutex.lock();
		group_allocator.free(group);
		groups.erase(p_group);
		task_mutex.unlock();
	} else {
		...
		if (finished_users == max_users) {
			// All tasks using this group are gone (finished before the group), so clear the group too.
			task_mutex.lock();
			group_allocator.free(group);
			groups.erase(p_group);
			task_mutex.unlock();
		} else {
			task_mutex.lock(); // This mutex is needed when Physics 2D and/or 3D is selected to run on a separate thread.
			groups.erase(p_group);
			task_mutex.unlock();
		}
	}

Actually... I wonder if the intent of the original code might have been that groups.erase(p_group); should only happen in the cases where group_allocator.free(group) was called, that is to say if (group->low_priority_native_tasks.size() > 0) in the original case, or if (finished_users == max_users) in the else. If this is true, I might suggest the following code instead.

	if (group->low_priority_native_tasks.size() > 0) {
		...
		task_mutex.lock();
		group_allocator.free(group);
		groups.erase(p_group);
		task_mutex.unlock();
	} else {
		...
		if (finished_users == max_users) {
			// All tasks using this group are gone (finished before the group), so clear the group too.
			task_mutex.lock();
			group_allocator.free(group);
			groups.erase(p_group);
			task_mutex.unlock();
		}
	}
reduz
reduz approved these changes Nov 28, 2022
@reduz
Copy link
Member

reduz commented Nov 28, 2022

This should be fine, performance should not really be an issue in this context.

@lyuma lyuma dismissed their stale review Nov 29, 2022

My comment is just a performance nit, and reduz responded to it. It shouldn't block merging.

@akien-mga akien-mga merged commit 6fdbf79 into godotengine:master Nov 29, 2022
13 checks passed
@akien-mga
Copy link
Member

akien-mga commented Nov 29, 2022

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment