Memory Ordering in Multicore Systems: The Hidden Complexity
The Surprising World of Memory Ordering
Most programmers have an intuitive model of how memory operations work: when you write a value, it’s immediately visible to all parts of the program. This model works fine for single-threaded code, but breaks down completely in multicore systems.
Consider this seemingly simple C++ code:
// Shared variables
int x = 0, y = 0;
// Thread 1
void thread1_func() {
x = 1;
int r1 = y;
}
// Thread 2
void thread2_func() {
y = 1;
int r2 = x;
}
Intuitively, after both threads run, we’d expect one of these outcomes:
- r1 = 0, r2 = 1 (Thread 1 ran first)
- r1 = 1, r2 = 0 (Thread 2 ran first)
- r1 = 1, r2 = 1 (The threads interleaved)
But on many modern architectures, you can actually get r1 = 0, r2 = 0! This happens because CPUs and compilers reorder memory operations for performance when they don’t appear to affect single-threaded behavior.
To manage this complexity, C++11 introduced a formal memory model with different memory ordering options:
#include <atomic>
#include <thread>
std::atomic<int> x{0}, y{0};
void thread1_func() {
x.store(1, std::memory_order_release);
int r1 = y.load(std::memory_order_acquire);
}
void thread2_func() {
y.store(1, std::memory_order_release);
int r2 = x.load(std::memory_order_acquire);
}
With these memory ordering specifications, we still might get r1 = 0, r2 = 0, because release-acquire only creates ordering between specific pairs of operations.
If we want to prevent the r1 = 0, r2 = 0 outcome, we would need stronger ordering:
void thread1_func() {
x.store(1, std::memory_order_seq_cst);
int r1 = y.load(std::memory_order_seq_cst);
}
void thread2_func() {
y.store(1, std::memory_order_seq_cst);
int r2 = x.load(std::memory_order_seq_cst);
}
With sequential consistency (std::memory_order_seq_cst
), the unintuitive outcome is impossible, but at a potential performance cost.
Understanding these subtleties is crucial for high-performance concurrent programming:
- Memory operations can be reordered by both the compiler and the CPU
- This reordering is usually invisible in single-threaded code
- In multi-threaded code, it can lead to surprising behaviors
- Using proper synchronization primitives and memory ordering creates the necessary barriers to control reordering
The key insight is that modern CPUs don’t present a simple unified view of memory to all cores. Instead, they create an intricate illusion that mostly works as expected, but occasionally requires explicit management to maintain correctness.