# Weak Memory Concurrency in C/C++11 and LLVM

Viktor Vafeiadis

Max Planck Institute for Software Systems (MPI-SWS)

March 2017

### Quiz #1. Should these transformations be allowed?

#### 1. CSE over acquiring a lock:

$$\begin{array}{ll} a = x; & a = x; \\ lock(); & \sim & lock(); \\ b = x; & b = a; \end{array}$$

2. Load hoisting:

$$\begin{array}{ll} \text{if } (c) & t = x; \\ a = x; & & a = c?t:a \end{array}$$

[x is a global variable; a, b, c are local; t is a fresh temporary.]

## Allowing both is clearly wrong!

Consider the transformation sequence:

When c is false, x is moved out of the critical region!

So we have to forbid one transfomation.

- C11 forbids load hoisting, allows CSE over lock().
- LLVM allows load hoisting, forbids CSE over lock().

## Formal model

#### Unambiguous specification

- Which are the possible outcomes of a program.
- Which optimizations may the compiler perform.

#### Typically called a weak memory model (WMM)

Allows more behaviors than thread interleaving.

#### Amenable to formal reasoning

- Can prove theorems about the model.
- Objectively compare memory models.

## Formal model

#### Unambiguous specification

- Which are the possible outcomes of a program.
- Which optimizations may the compiler perform.

#### Typically called a weak memory model (WMM)

Allows more behaviors than thread interleaving.

#### Amenable to formal reasoning

- Can prove theorems about the model.
- Objectively compare memory models.

#### But it is not easy to get right

- ► The Java memory model is flawed.
- ► The C/C++11 model is also flawed.

## Overview



#### WMM desiderata

- 1. Mathematically sane (e.g., monotone)
- 2. Not too weak (good for programmers)
- 3. Not too strong (good for hardware)
- 4. Admits optimizations (good for compilers :-)

## Overview



#### Outline

- How to define a weak memory model?
- ► The C/C++ memory model (a.k.a. C11)
- Unfortunate flaws in C11
- The OOTA problem
- A 'promising' solution

## Three approaches for defining WMMs

#### Operational

Define program semantics with an abstract machine.

#### Transformational

 Define the model as a sequence of program transformations over some basic model (e.g., SC).

#### Axiomatic

 Define the model as a set of consistency constraints on program *executions*.

## Operational approach

Define program semantics with an abstract machine.

- Works well for most hardware models.
- ▶ Very low-level ~→ cumbersome to reason about.
- May require elaborate features for PL models.



## Transformational approach

Define the model as a sequence of program transformations over some basic operational model, such as SC.

For example,

 $\mathsf{TSO} = \mathsf{SC} + \mathsf{WR}\text{-reordering} + \mathsf{RaW}\text{-elimination}$ 

## Transformational approach

Define the model as a sequence of program transformations over some basic operational model, such as SC.

For example,

 $\mathsf{TSO} = \mathsf{SC} + \mathsf{WR}\text{-reordering} + \mathsf{RaW}\text{-elimination}$ 

#### But:

- Applicable only in very few cases.
- Does not work for ARM.

#### ARM weak

$$a = x; \ // 1 \ x = 1; \ y = x; \ x = y;$$

Define the model as a set of consistency constraints on program executions.

Example: Load-buffering

$$a = x; // 1$$
 ||  $b = y; // 1$   
 $y = 1;$  ||  $x = b;$ 

- Works well for hardware models.
- ▶ Followed by C11.
- Problematic for programming languages because of OOTA ("out of thin air") values.

$$[x = y = 0]$$

$$R x, 1 \qquad R y, 1$$

$$\psi y, 1 \qquad \forall x, 1$$

$$Program order$$

$$reads from$$

## The C11 memory model

- Introduced by the ISO C/C++ 2011 standards.
- Defines the semantics of concurrent memory accesses.
- Adopted by the LLVM IR with some changes. (The differences are not relevant for this talk.)

The C11 memory model: Atomics

Two types of locations

Ordinary (Non-Atomic)

Races are errors

Atomic

Welcome to the expert mode

#### A spectrum of accesses



Explicit primitives for fences

## An execution in C11: actions and relations (and axioms)

Initially 
$$x = y = 0$$
.  
 $x = 5;$   
 $y.store(1, release);$  while  $(y.load(acq) == 0);$   
 $print(x);$ 

#### One possible execution



## Relaxed behavior: store buffering

This can return a = b = 0.

#### Justification



Behavior observed on x86/Power/ARM

## Coherence

Programs with a single shared variable behave as under SC.

x.store(1, 
$$rlx$$
);  $a = x.load( $rlx$ );  
x.store(2,  $rlx$ );  $b = x.load( $rlx$ );$$ 

The outcome  $a = 2 \land b = 1$  is forbidden.



## Coherence

Programs with a single shared variable behave as under SC.

x.store(1, 
$$rlx$$
);  $a = x.load( $rlx$ );  
x.store(2,  $rlx$ );  $b = x.load( $rlx$ );$$ 

The outcome  $a = 2 \land b = 1$  is forbidden.

$$\begin{array}{c|c} \mathbb{W}_{\texttt{rlx}} x, 1 & \mathbb{R}_{\texttt{rlx}} x, 2 \\ \mathbb{M}_{\texttt{mox}} & & \downarrow \\ \mathbb{W}_{\texttt{rlx}} x, 2 & \mathbb{R}_{\texttt{rlx}} x, 1 \end{array}$$

- ▶ Modification order, mo<sub>x</sub>, total order of writes to x.
- ▶ Reads-before :  $\mathbf{rb}_{x} \triangleq (\mathtt{rf}^{-1}; \mathtt{mo}_{x}) \cap (\neq)$
- Coherence :  $hb \cup rf_x \cup mo_x \cup rb_x$  is acyclic for all x.

## Causality cycles with relaxed accesses

Initially 
$$x = y = 0$$
.

$$\begin{array}{c|c} \text{if } (x.\textit{load}(\textit{rlx}) == 1) \\ y.\textit{store}(1,\textit{rlx}); \end{array} \quad \begin{array}{c|c} \text{if } (y.\textit{load}(\textit{rlx}) == 1) \\ x.\textit{store}(1,\textit{rlx}); \end{array}$$

C11 allows the outcome x = y = 1.



## No causality cycles with non-atomics

Initially 
$$x = y = 0$$
.  
**if**  $(x == 1) \parallel$ **if**  $(y == 1)$   
 $y = 1; \parallel$  $x = 1;$ 

C11 forbids the outcome x = y = 1.

#### Justification

Non-atomic read axiom:

 $\texttt{rf} \cap (\_ \times \textit{NA}) \subseteq \texttt{hb}$ 

- 1. Mathematically sane?
  - For example, it is monotone.
- 2. Not too weak?
  - Provides useful reasoning principles.
- 3. Not too strong?
  - Can be implemented efficiently.
- 4. Actually useful?
  - Admits the intended program optimizations.

- 1. Mathematically sane?
  - For example, it is monotone.
- 2. Not too weak?
  - Provides useful reasoning principles.
- 3. Not too strong?

✓ Compilation to x86/Power/ARM.

- 4. Actually useful?
  - Admits the intended program optimizations.

- 1. Mathematically sane?
  - For example, it is monotone.
- 2. Not too weak?

 $\approx$  Reasoning principles for C11 subsets.

3. Not too strong?

✓ Compilation to x86/Power/ARM.

4. Actually useful?

Admits the intended program optimizations.

1. Mathematically sane?

X No, it is not monotone.

2. Not too weak?

 $\approx$  Reasoning principles for C11 subsets.

3. Not too strong?

✓ Compilation to x86/Power/ARM.

- 4. Actually useful?
  - Admits the intended program optimizations.



X No, it is not monotone.

2. Not too weak?

 $\approx$  Reasoning principles for C11 subsets.

3. Not too strong?

✓ Compilation to x86/Power/ARM.

4. Actually useful?

X No, it disallows intended program transformations.

1. Mathematically sane?

X No, it is not monotone.

2. Not too weak?

 $\approx$  Reasoning principles for C11 subsets.

3. Not too strong?

X Compilation to Power and ARM is broken.

4. Actually useful?

X No, it disallows intended program transformations.

#### Non-atomic reads of atomic variables are unsound!

Initially, x = 0. x.store(1, rlx);  $\| \begin{array}{l} \text{if } (x.\text{load}(rlx) == 1) \\ t = (\text{int}) x$ ;

The program can get stuck!



- Reading 0 contradicts coherence.
- Reading 1 contradicts the non-atomic read axiom.

## Sequentialization is invalid

Initially, 
$$a = x = y = 0$$
.  

$$a = 1; \quad \begin{vmatrix} \text{if } (x.\text{load}(rlx) == 1) \\ \text{if } (a == 1) \\ y.\text{store}(1, rlx); \end{vmatrix} \quad \text{if } (y.\text{load}(rlx) == 1) \\ x.\text{store}(1, rlx); \end{vmatrix}$$

The only possible output is:

$$a=1, \quad x=y=0.$$

Recall the non-atomic read axiom:

 $\texttt{rf} \cap (\_ \times \textit{NA}) \subseteq \texttt{hb}$ 

## Tentative fixes

Remove non-atomic read axiom.

- gives extremely weak guarantees, if any
- In addition, forbid ( $po \cup rf$ )-cycles.
  - rules out causal loops
  - forbids some reorderings
  - more costly on ARM/Power

Related to the OOTA problem....

More in a couple of slides

## Monotonicity

# "Adding synchronization should not introduce new behaviors"

Examples:

- Reducing parallelism,  $C_1 || C_2 \rightsquigarrow C_1$ ;  $C_2$
- Expression evaluation linearization:

$$x=a+b$$
;  $\rightsquigarrow$   $t_1=a$ ;  $t_2=b$ ;  $x=t_1+t_2$ ;

- Adding a memory fence
- Strengthening the access mode of an operation
- (Roach motel reorderings)

#### C11 semantics for SC accesses is broken!

#### IRIW-acq-sc

$$\begin{array}{c|c} x_{\rm sc} = 1; \\ & a = x_{\rm acq}; \ // \ 1 \\ & c = y_{\rm sc}; \ // \ 0 \end{array} \begin{array}{c|c} b = y_{\rm acq}; \ // \ 1 \\ & d = x_{\rm sc}; \ // \ 0 \end{array} \begin{array}{c|c} y_{\rm sc} = 1; \\ & d = x_{\rm sc}; \ // \ 0 \end{array}$$

C11 disallows the annotated behavior:



The behavior is, however, allowed on POWER/ARM following the "trailing sync" compilation scheme.

(PLDI'17)

## Other problems fixed

The axiom of SC reads is too weak.

Makes strengthening unsound.

The axioms of SC fences are too weak.

They do not guarantee sequential consistency.

The definition of release sequences is too strong.

▶ Removing (po ∪ rf)-final events is unsound.

# The OOTA problem

#### The out-of-thin-air problem in C11

- Initially, x = y = 0.
- All accesses are "relaxed".

#### Load-buffering $a = x; \ // 1$ y = 1; x = y;

This behavior must be allowed: Power/ARM allow it



## The out-of-thin-air problem in C11

Load-buffering + data dependency

$$a = x; // 1 y = a;$$
  $x = y;$ 

The behavior should be forbidden: Values appear out-of-thin-air!



Same execution as before! C11 allows these behaviors

## The out-of-thin-air problem in C11

Load-buffering + data dependency

$$a = x; // 1$$
  
 $y = a;$   $x = y;$ 

The behavior should be forbidden: **Values appear out-of-thin-air!** 

Load-buffering + control dependencies

$$\begin{array}{c} a = x; \ // 1 \\ \text{if} (a == 1) \\ y = 1; \end{array} \quad \begin{array}{c} \text{if} (y == 1) \\ x = 1; \end{array}$$



The behavior should be forbidden: **DRF guarantee is broken!** 

Same execution as before! C11 allows these behaviors

## The hardware solution

Keep track of syntactic dependencies, and forbid "dependency cycles".

Load-buffering + data dependency a = x; // 1y = a; x = y;



## The hardware solution

Keep track of syntactic dependencies, and forbid "dependency cycles".



This approach is not suitable for a programming language: **Compilers do not preserve syntactic dependencies.** 

# A 'promising' solution to OOTA

We propose a model that satisfies all WMM desiderata, and covers nearly all features of C11.

- ► No "out-of-thin-air" values ► Efficient h/w mappings
- DRF guarantees
- Compiler optimizations

**Key idea:** Start with an operational interleaving semantics, but allow threads to **promise** to write in the future.

Store-buffering

 
$$x = y = 0$$
 $x = 1;$ 
 $a = y; \# 0$ 
 $y = 1;$ 
 $b = x; \# 0$ 



Global memory is a pool of messages of the form

*(location : value @ timestamp)* 



Global memory is a pool of messages of the form

*(location : value @ timestamp)* 



Global memory is a pool of messages of the form

*(location : value @ timestamp)* 



Global memory is a pool of messages of the form

*(location : value @ timestamp)* 



Global memory is a pool of messages of the form

*(location : value @ timestamp)* 

Store-bufferingMemory
$$x = y = 0$$
 $\langle x : 0@0 \rangle$  $x = 1;$  $y = 1;$  $a = y; \# 0$  $b = x; \# 0$  $\downarrow$  $\langle y : 0@0 \rangle$  $\langle y : 1@1 \rangle$  $1$ 

Coherence Test

 
$$x = 0$$
 $x = 1;$ 
 $x = 2;$ 
 $a = x; \ // 2$ 
 $b = x; \ // 1$ 

Store-bufferingMemory
$$x = y = 0$$
 $\langle x : 0@0 \rangle$  $x = 1;$  $y = 1;$  $a = y; \# 0$  $b = x; \# 0$  $\downarrow$  $\langle y : 0@0 \rangle$  $\langle y : 1@1 \rangle$  $1$ 



Store-bufferingMemory
$$x = y = 0$$
 $\langle x : 0@0 \rangle$  $x = 1;$  $y = 1;$  $a = y; \# 0$  $b = x; \# 0$  $\downarrow$  $\langle y : 0@0 \rangle$  $\langle y : 1@1 \rangle$  $1$ 

Coherence Test  

$$\begin{array}{c|c}
x = 0 \\
x = 1; \\
a = x; \ // 2
\end{array} \xrightarrow{k = 2; \\
b = x; \ // 1
\end{array}$$
Memory  

$$\begin{array}{c|c}
x : 0@0 \\
\langle x : 0@0 \rangle \\
\langle x : 1@1 \rangle
\end{array}$$

$$\begin{array}{c|c}
T_1's \text{ view} \\
x \\
\hline
X \\
1
\end{array}$$

$$\begin{array}{c|c}
T_2's \text{ view} \\
x \\
\hline
0
\end{array}$$

Store-bufferingMemory
$$x = y = 0$$
 $\langle x : 0@0 \rangle$  $x = 1;$  $y = 1;$  $a = y; \# 0$  $b = x; \# 0$  $\downarrow$  $\langle y : 0@0 \rangle$  $\langle y : 1@1 \rangle$  $1$ 



Store-bufferingMemory
$$x = y = 0$$
 $\langle x : 0@0 \rangle$  $x = 1;$  $y = 1;$  $a = y; \# 0$  $b = x; \# 0$  $\downarrow$  $\langle y : 0@0 \rangle$  $\langle y : 1@1 \rangle$  $1$ 



Store-bufferingMemory
$$x = y = 0$$
 $\langle x : 0@0 \rangle$  $x = 1;$  $y = 1;$  $a = y; \# 0$  $b = x; \# 0$  $\downarrow$  $\langle y : 0@0 \rangle$  $\langle y : 1@1 \rangle$  $1$ 





- To model load-store reordering, we allow "promises".
- At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message.



- To model load-store reordering, we allow "promises".
- At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message.



- To model load-store reordering, we allow "promises".
- At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message.



- To model load-store reordering, we allow "promises".
- At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message.



- To model load-store reordering, we allow "promises".
- At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message.



- To model load-store reordering, we allow "promises".
- At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message.



- To model load-store reordering, we allow "promises".
- At any point, a thread may promise to write a message in the future, allowing other threads to read from the promised message.



| Memory                                                                                              |                                                                                                     |                                                                                              |
|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| $\langle x: 0@0 \rangle$ $\langle y: 0@0 \rangle$ $\langle y: 1@1 \rangle$ $\langle x: 1@1 \rangle$ | $\begin{array}{c c} T_1 \text{'s view} \\ \hline X & y \\ \hline X & \swarrow \\ 1 & 1 \end{array}$ | $\begin{array}{c c} T_2 \text{'s view} \\ \hline X & y \\ \hline X & X \\ 1 & 1 \end{array}$ |
|                                                                                                     |                                                                                                     |                                                                                              |

| Load-buffering + dependency                              |        |  |
|----------------------------------------------------------|--------|--|
| $ \begin{array}{c} a = x; \ // 1 \\ y = a; \end{array} $ | x = y; |  |

Must not admit the same execution!

# Load-buffering x = y = 0 $a = x; \ // 1$ y = 1; x = y;

Load-buffering + dependency  

$$a = x; \# 1$$
  
 $y = a; \# x = y;$ 

## Key Idea

A thread can only promise if it can perform the write anyway (even without having made the promise)

# Certified promises

#### Thread-local certification

A thread can promise to write a message, if it can *thread-locally certify* that its promise will be fulfilled.

# Certified promises

#### Thread-local certification

A thread can promise to write a message, if it can *thread-locally certify* that its promise will be fulfilled.

Load-buffering
$$a = x; \# 1$$
 $y = 1;$  $x = y;$ 

Load buffering + fake dependency  

$$a = x; // 1$$
  
 $y = a + 1 - a; || x = y;$ 

 $T_1$  may promise y = 1, since it is able to write y = 1 by itself.

Load buffering + dependency a = x; // 1y = a; x = y;

 $T_1$  may NOT promise y = 1, since it is not able to write y = 1 by itself.

## Is this behavior possible?

$$a = x; \# 1$$
  
 $x = 1;$ 

## Is this behavior possible?

$$a = x; \# 1$$
  
 $x = 1;$ 

#### No.

Suppose the thread promises x = 1. Then, once a = x reads 1, the thread view is increased and so the promise cannot be fulfilled.

Quiz #3

## Is this behavior possible?

$$a = x; \ // 1 \ x = 1; \ y = x; \ x = y;$$

Quiz #3

#### Is this behavior possible?

$$a = x; \ // 1 \ x = 1; \ y = x; \ x = y;$$

#### Yes. And the ARM model allows it!

Quiz #3

#### Is this behavior possible?

$$egin{array}{c|c} a = x; \ \# \ 1 \ x = 1; \end{array} & y = x; \ \# \ x = y; \end{array}$$

### Yes. And the ARM model allows it!

This behavior can be also explained by sequentialization:

$$\begin{array}{c} a = x; \ // \ 1 \\ x = 1; \end{array} \right\| y = x; \\ x = y; \\ y = x; \end{array} \xrightarrow{\begin{subarray}{c} a = x; \ // \ 1 \\ x = y; \\ y = x; \end{array} } x = y;$$

#### But, note that sequentialization is generally unsound in our model:

$$\begin{array}{c} a := x; \ // 1 \\ \text{if } a = 0 \text{ then} \\ x := 1; \end{array} \left\| y := x; \right\| x := y; \quad \sim \quad \begin{array}{c} a := x; \ // 1 \\ \text{if } a = 0 \text{ then} \\ x := 1; \\ y := x; \end{array} \right\| x := y; \quad \sim \quad \begin{array}{c} a := x; \ // 1 \\ \text{if } a = 0 \text{ then} \\ x := 1; \\ y := x; \end{array} \right\| x := y; \quad \quad \end{array}$$

#### The full model

- Atomic updates (e.g., CAS, fetch-and-add)
- Release/acquire fences and accesses
- Release sequences
- SC fences (no SC accesses)
- Plain accesses (C11's non-atomics & Java's normal accesses)

To achieve all of this we enrich our timestamps, messages, and thread views.

#### Message-passing

$$\begin{array}{c} x = y = 0 \\ x := 1; \\ y :=_{rel} 1; \end{array} \quad \begin{vmatrix} a := y_{acq}; & //1 \\ b := x; & //1 \end{vmatrix}$$

#### Message-passing

$$\begin{array}{c} x = y = 0 \\ \blacktriangleright x := 1; \\ y :=_{rel} 1; \end{array} \qquad \begin{array}{c} a := y_{acq}; \ \ // 1 \\ b := x; \ \ // 1 \end{array}$$



#### Message-passing

$$x = y = 0$$
  

$$x := 1;$$
  

$$y :=_{rel} 1;$$
  

$$a := y_{acq}; //1$$
  

$$b := x; //1$$



#### Message-passing

$$\begin{array}{c|c} \textbf{Memory} \\ \langle x: 0@0 \rangle \\ \langle y: 0@0 \rangle \\ \langle x: 1@1 \rangle \\ \langle y: 1@1 x@1 \rangle \end{array} \end{array} \begin{array}{c} T_1 \text{'s view} \\ \hline T_2 \text{'s view} \\ \hline x y \\ \hline 0 0 \\ \hline 0 \end{array}$$

#### Message-passing

$$\begin{array}{ccc}
x = y = 0 \\
x := 1; \\
y :=_{rel} 1; \\
& \flat := x; \ // 1
\end{array}$$

$$\begin{array}{c|c} \textbf{Memory} \\ \langle x: 0@0 \rangle \\ \langle y: 0@0 \rangle \\ \langle x: 1@1 \rangle \\ \langle y: 1@1 x@1 \rangle \end{array} \end{array} \begin{array}{c} T_1 \text{'s view} \\ T_2 \text{'s view} \\ \hline x y \\ \hline 1 1 \end{array} \end{array}$$

#### Message-passing

$$\begin{array}{c} x = y = 0 \\ x := 1; \\ y :=_{rel} 1; \\ \end{array} \quad \begin{array}{c} a := y_{acq}; \ \ \# 1 \\ b := x; \ \ \# 1 \\ \end{array}$$

$$\begin{array}{c|c} \textbf{Memory} \\ \langle x: 0@0 \rangle \\ \langle y: 0@0 \rangle \\ \langle x: 1@1 \rangle \\ \langle y: 1@1 x@1 \rangle \end{array} \end{array} \begin{array}{c} T_1 \text{'s view} \\ T_2 \text{'s view} \\ x y \\ \hline \chi & \chi \\ 1 1 \end{array} \end{array} \begin{array}{c} T_2 \text{'s view} \\ \hline \chi & y \\ \hline \chi & \chi \\ 1 1 \end{array}$$

#### Certification is needed at every step

#### Key lemma for DRF

Races only on RA under promise-free semantics  $\Rightarrow$  only promise-free behaviors

$$w :=_{rel} 1; \qquad \begin{array}{c} \text{if } w_{acq} = 1 \text{ then} \\ z := 1; \\ \text{else} \\ y :=_{rel} 1; \\ a := x; \ //1 \\ \text{if } a = 1 \text{ then} \\ z := 1; \end{array} \qquad \begin{array}{c} \text{if } y_{acq} = 1 \text{ then} \\ \text{if } z = 1 \text{ then} \\ x := 1; \end{array}$$

# The full model

(POPL'17)

We have extended this basic idea to handle:

- Atomic updates (e.g., CAS, fetch-and-add)
- Release/acquire fences and accesses
- Release sequences
- SC fences
- ▶ Plain accesses (C11's non-atomics & Java's normal accesses)

## Results

- No "out-of-thin-air" values
- DRF guarantees
- Efficient h/w mappings (x86-TSO, Power, ARM)
- Compiler optimizations (incl. reorderings, eliminations)



#### Summary

- ► The need for a WMM.
- C11 is very broken.
- Many of the problems are locally fixable.
- But ruling out OOTA requires an entirely different approach.
- The promising model may be the solution.

Initially, 
$$X = Y = 0$$
.  
 $X = 2;$   
 $a = X; // 3$   
if  $(a \neq 2)$   
 $Y = 1;$   
 $X = 1;$   
 $b = X; // 2$   
 $c = Y; // 1$   
if  $(c)$   
 $X = 3;$   
(Coh-CYC)

This example is taken from the paper "Grounding Thin-Air Reads with Event Structures" by Soham Chakraborty and Viktor Vafeiadis (POPL 2019).