Race condition in thread creation

zeoman · 14 December 2024 13:55

Hi everyone,

I came across Odin a few days ago and must say I’ve very much enjoyed learning it so far (I suppose I am the intended audience since C has been my mainstay language for many years). However, while having some fun testing threads, I came across what seems to be a race condition when starting threads in the following code:

package main
import "core:os"
import "core:fmt"
import thr "core:thread"
import "core:sync"
import "core:time"

nb_threads :: 8
end_barrier : sync.Barrier

say_hello :: proc (self: ^thr.Thread)
{
	fmt.printf ("Hello from %d! (barrier count: %d)\n", self.user_index, end_barrier.thread_count)
	os.flush (os.stdout)
	sync.barrier_wait (&end_barrier)
}

main :: proc ()
{
	fmt.println ("BEGIN")
	fmt.printf ("nb_threads=%d\n", nb_threads)
	sync.barrier_init (&end_barrier, nb_threads)

	threads: [nb_threads]^thr.Thread
	for i in 0 ..< nb_threads
	{
		fmt.printf ("Creating thread %d\n", i)
		threads [i] = thr.create (say_hello)
		threads [i].user_index = i;
	}

	for thread in threads
	{
		fmt.printf ("Starting thread %d\n", thread.user_index)
		thr.start (thread)
	}

	//time.sleep (1.0e9)
	for thread in threads
	{
		thr.join (thread)
		fmt.printf ("Joined thread %d\n", thread.user_index)
	}

	for thread in threads
	{
		saved_index := thread.user_index
		thr.destroy (thread)
		fmt.printf ("Destroyed thread %d\n", saved_index)
	}

	fmt.println ("END")
}

The code runs fine if every thread prints its hello message before any join occurs. If a join occurs before one of the threads prints its message however, the program seems to deadlock. The only case in which it happens always seems to be that thread 0 (that is, the first one created and started) doesn’t print its message and immediately joins with the main thread, apparently ignoring the barrier as well (so I imagine the body of the say_hello proc isn’t executed at all on this thread). This also happens if we reverse the joining order, i.e. the last thread may be joined without printing its message and then the deadlock occurs. The additional 1 second sleep resolves the problem in any case, so it does look like a race condition.

I don’t think similar code using C/Pthreads would lock in this manner, so maybe am I missing some different semantics ? Maybe an explicit memory barrier is required after thread creation ?

Laytan · 14 December 2024 15:07

Hey! I think you are running into this (all duplicates): thread.join kills thread work instead of waiting for it to finish · Issue #4518 · odin-lang/Odin · GitHub
Odin join does not wait for thread to start before joining. · Issue #3924 · odin-lang/Odin · GitHub
[core/thread] "unexpected" Odin thread semantics that differ from pthread's semantics · Issue #3622 · odin-lang/Odin · GitHub

zeoman · 14 December 2024 16:16

Ok, thanks for your answer! Indeed I didn’t think of checking the issue tracker. I don’t have an account on Github, but if it helps, the following lines seem to be involved:

https://github.com/odin-lang/Odin/blob/8b1c9b0ff5b7ad392fa48e050eae460da8edb982/core/thread/thread_unix.odin#L36-L38

I added a debug message which indeed appears only when there is a deadlock. Commenting the block entirely seems to prevent the race and the deadlock. I didn’t dive too deeply into the code, but in my understanding, this is used to prevent previously joined threads from being created over?

Feoramund · 14 December 2024 19:03

The barrier is unsatisfied and causes the program to halt once it joins one of the threads waiting on the barrier.

You have a barrier in place as your child thread synchronization method.
This barrier depends on all threads starting.
Thread A is marked as Joined by your call to join.
Thread A, per core:thread, eventually starts, sees that it has permission to start, sees that it has been marked as Joined and returns early without ever calling the thread procedure.
Therefore, the barrier never reaches its count, and all the threads stall.

There is in this code the assumption that joining a child thread causes an implicit wait on the main thread, however this is not the case in Odin currently. This means it is possible for a thread to never start the user procedure if you issue a join before waiting and confirming some signal for it.

The way I handle this is to bring the main thread into the synchronization and always wait before joining. You can use a barrier or a wait group for this. In actual fact, if you change the barrier_init to include the main thread and use barrier_wait before joining, your program works as expected.

I am responsible for the commit that introduced this code. I added it some time ago to prevent a thread from blocking on the start_ok semaphore if it joined another thread which was created but never started.

However, I had to think for a moment on why I added the other check in the entry proc which you’ve pointed out, since it’s been a while. This early exit prevents a thread which has been created using the Odin core:thread API - but not started using the Odin API - from ever performing any work.

There is no pthreads start, since as soon as you create a thread, it’s running off to do some work. You have to start it if you eventually want to join it, otherwise it’ll just sit on that semaphore and block the main thread, and we’re back to the original situation we were trying to avoid.

In short, this condition helpfully keeps a thread from ever doing anything if it’s never started but is eventually joined.

zeoman · 16 December 2024 12:53

I see, thanks very much for the insight!