In parallel computing, a barrier is a synchronization method. A barrier for a group of threads or processes in the source code means that all thread/process stop at that point and do not proceed until all other threads/processes reach this barrier.

Many collective routines and directive-based parallel languages impose implicit barriers. For example, a parallel do loop in Fortran with OpenMP will not be allowed to continue on any thread until the last iteration is completed.[citation needed] This is in case the program relies on the result of the loop immediately after its completion. In message passing, any global communication (such as reduction or scatter) may imply a barrier.

In concurrent computing, a barrier may be in a "raised" or "lowered state". The term "latch" is sometimes used to refer to a barrier that starts in the raised state and cannot be re-raised once it is in the lowered state. The term "count-down latch" is sometimes used to refer to a latch that is automatically lowered once a predetermined number of threads/processes have arrived.

Implementation

Consider the following example of a thread barrier. The thread barrier requires a variable to "keep track of the total number of threads that have entered the barrier. Whenever sufficient threads enter the barrier, it will be lifted. A synchronization primitive like a mutex is also needed when implementing the thread barrier.

This thread barrier method is also known as a "centralized barrier", as the threads wait before a "central barrier" until the expected number of threads have reached the barrier before it is lifted.

This can be demonstrated by the following C example using POSIX threads.

In this program, the thread barrier, struct Barrier, consists of:

  • lock: A POSIX thread mutex lock
  • thread_count: Total threads in the process
  • barrier_count: Total number of threads expected to enter the thread barrier so that it can be lifted

Based on the definition of barrier, the implementation requires a function like thread_barrier_wait() in this program which "monitors" the total number of thread in the program in order to life the barrier.

In this program, every thread calls thread_barrier_wait() will be blocked until THREAD_BARRIERS_NUMBER threads reach the thread barrier. As the main thread is blocked due to not having three threads, the line "Thread barrier is lifted" is never reached. The result of that program is:

Only two threads are created with thread_func() as the thread function handler, which calls thread_barrier_wait(&barrier), while the thread barrier expects three threads to call thread_barrier_wait() in order to be lifted.

Upon changin TOTAL_THREADS to 3, the thread barrier is lifted:

Sense-reversal centralized barrier

Besides decrementing the total thread number for each thread successfully passing the thread barrier, thread barriers may use opposite values to mark each thread state as passing or stopping. For example, 0 may denote stopping at the barrier while 1 denotes passing the barrier. This is known as "sense reversal", as demonstrated by the following:

This version of the previous centralized barrier implementation introduces two new variables:

  • local_flag: A thread-local boolean to check whether THREAD_BARRIERS_NUMBER threads have arrived at the barrier.
  • flag: A boolean flag in struct Barrier indicating whether THREAD_BARRIERS_NUMBER threads have arrived at the barrier.

When a thread stops at the barrier, local_flag's value is flipped. When there are less than THREAD_BARRIERS_NUMBER threads stopping at the thread barrier, those threads will keep waiting with the condition that barrier.flag is not equal to local_flag.

When there are exactly THREAD_BARRIERS_NUMBER threads stopping at the thread barrier, thread_count is reset to 0 and flag is set to local_flag.

Combining tree barrier

A potential problem with the centralized barrier implementation is that because all threads repeatedly access the global variable to pass/stop, communication traffic is high, decreasing scalability. This problem can be resolved by regrouping the threads and using multi-level barrier, such as using a combining tree barrier. Hardware implementations may have greater scalability. A combining tree barrier is a hierarchical way of implementing barrier to resolve scalability by avoiding the case that all threads are spinning at the same location.

In a k {\displaystyle k}-tree barrier, all threads are equally divided into subgroups of k {\displaystyle k} threads and a first round of synchronizations are done within these subgroups. Once all subgroups have completed synchronization, the first thread in each subgroup enters a second level for further synchronization. In the second level, like in the first level, the threads form new subgroups of k {\displaystyle k} threads and synchronize within groups, sending out one thread in each subgroup to next level and so on, until in the final level, only one subgroup needs to be synchronized. After the final level of synchronization, the releasing signal is transmitted to upper levels and all threads pass the barrier.

Hardware barrier implementation

A hardware barrier uses hardware to implement the above basic barrier model.

The simplest hardware implementation uses dedicated wires to transmit signals to implement a barrier. This dedicated wire performs OR/AND operations to act as pass/block flags and thread counters. For small systems, such a model is sufficient and communication speed is not a major concern. In large multiprocessor systems, this hardware design can lead to high latency in the barrier implementation. The network connection among processors is one means to lower latency, analogous to using a combining tree barrier.

POSIX Threads barrier functions

The POSIX Threads standard directly supports thread barrier functions which can be used to block the specified threads or the whole process at the barrier until other threads to reach that barrier. The POSIX Threads offer three main APIs:

  • pthread_barrier_init(): Initializes the thread barrier with the number of threads necessary to wait at the barrier to lift it.
  • pthread_barrier_destroy(): Destroys the thread barrier and releases the resource.
  • pthread_barrier_wait(): Blocks the current thread until the number of threads specified by pthread_barrier_init() calls pthread_barrier_wait() to lift the barrier.

The following example in C using POSIX Threads uses thread barriers to block all the threads of the main process, thus blocking the whole process.

As the main process is blocked and the line "Thread barrier is lifted" is never reached, the result of that source code is:

Only two threads are created, both with thread_func() as the thread function handler, which calls pthread_barrier_wait(&barrier), while the thread barrier expects three threads to call pthread_barrier_wait() in order to be lifted.

Upon changing TOTAL_THREADS to 3, the thread barrier is lifted:

As main() itself is treated as a thread (i.e. the "main" thread of the process), calling pthread_barrier_wait() inside main() will block the whole process until other threads reach the barrier. The following example, using thread barriers with pthread_barrier_wait() inside main() blocks the main thread for 5 seconds while waiting for the two "newly created" threads to reach the thread barrier.

This example avoids using pthread_join() to wait for the two "newly created" threads to complete. It calls pthread_barrier_wait() inside main() to block the main thread, so that the process is blocked until the two threads finish its operation after waiting 5 seconds.

See also

External links

. sourceallies.com. March 2012.