The Surprising World of Memory Ordering

Most programmers have an intuitive model of how memory operations work: when you write a value, it’s immediately visible to all parts of the program. This model works fine for single-threaded code, but breaks down completely in multicore systems.

Consider this seemingly simple C++ code:

// Shared variables
int x = 0, y = 0;

// Thread 1
void thread1_func() {
    x = 1;
    int r1 = y;
}

// Thread 2
void thread2_func() {
    y = 1;
    int r2 = x;
}

Intuitively, after both threads run, we’d expect one of these outcomes:

r1 = 0, r2 = 1 (Thread 1 ran first)
r1 = 1, r2 = 0 (Thread 2 ran first)
r1 = 1, r2 = 1 (The threads interleaved)

But on many modern architectures, you can actually get r1 = 0, r2 = 0! This happens because CPUs and compilers reorder memory operations for performance when they don’t appear to affect single-threaded behavior.

To manage this complexity, C++11 introduced a formal memory model with different memory ordering options:

#include <atomic>
#include <thread>

std::atomic<int> x{0}, y{0};

void thread1_func() {
    x.store(1, std::memory_order_release);
    int r1 = y.load(std::memory_order_acquire);
}

void thread2_func() {
    y.store(1, std::memory_order_release);
    int r2 = x.load(std::memory_order_acquire);
}

With these memory ordering specifications, we still might get r1 = 0, r2 = 0, because release-acquire only creates ordering between specific pairs of operations.

If we want to prevent the r1 = 0, r2 = 0 outcome, we would need stronger ordering:

void thread1_func() {
    x.store(1, std::memory_order_seq_cst);
    int r1 = y.load(std::memory_order_seq_cst);
}

void thread2_func() {
    y.store(1, std::memory_order_seq_cst);
    int r2 = x.load(std::memory_order_seq_cst);
}

With sequential consistency (std::memory_order_seq_cst), the unintuitive outcome is impossible, but at a potential performance cost.

Understanding these subtleties is crucial for high-performance concurrent programming:

Memory operations can be reordered by both the compiler and the CPU
This reordering is usually invisible in single-threaded code
In multi-threaded code, it can lead to surprising behaviors
Using proper synchronization primitives and memory ordering creates the necessary barriers to control reordering

The key insight is that modern CPUs don’t present a simple unified view of memory to all cores. Instead, they create an intricate illusion that mostly works as expected, but occasionally requires explicit management to maintain correctness.