2012-06-29 09:00:01 +00:00
|
|
|
.. _atomics:
|
|
|
|
|
|
|
|
==============================================
|
|
|
|
LLVM Atomic Instructions and Concurrency Guide
|
|
|
|
==============================================
|
|
|
|
|
|
|
|
.. contents::
|
|
|
|
:local:
|
|
|
|
|
|
|
|
Introduction
|
|
|
|
============
|
|
|
|
|
|
|
|
Historically, LLVM has not had very strong support for concurrency; some minimal
|
|
|
|
intrinsics were provided, and ``volatile`` was used in some cases to achieve
|
|
|
|
rough semantics in the presence of concurrency. However, this is changing;
|
|
|
|
there are now new instructions which are well-defined in the presence of threads
|
|
|
|
and asynchronous signals, and the model for existing instructions has been
|
|
|
|
clarified in the IR.
|
|
|
|
|
|
|
|
The atomic instructions are designed specifically to provide readable IR and
|
|
|
|
optimized code generation for the following:
|
|
|
|
|
|
|
|
* The new C++0x ``<atomic>`` header. (`C++0x draft available here
|
|
|
|
<http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C1x draft available here
|
|
|
|
<http://www.open-std.org/jtc1/sc22/wg14/>`_.)
|
|
|
|
|
|
|
|
* Proper semantics for Java-style memory, for both ``volatile`` and regular
|
|
|
|
shared variables. (`Java Specification
|
|
|
|
<http://java.sun.com/docs/books/jls/third_edition/html/memory.html>`_)
|
|
|
|
|
|
|
|
* gcc-compatible ``__sync_*`` builtins. (`Description
|
|
|
|
<http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html>`_)
|
|
|
|
|
|
|
|
* Other scenarios with atomic semantics, including ``static`` variables with
|
|
|
|
non-trivial constructors in C++.
|
|
|
|
|
|
|
|
Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
|
|
|
|
which ensures that every volatile load and store happens and is performed in the
|
|
|
|
stated order. A couple examples: if a SequentiallyConsistent store is
|
|
|
|
immediately followed by another SequentiallyConsistent store to the same
|
|
|
|
address, the first store can be erased. This transformation is not allowed for a
|
|
|
|
pair of volatile stores. On the other hand, a non-volatile non-atomic load can
|
|
|
|
be moved across a volatile load freely, but not an Acquire load.
|
|
|
|
|
|
|
|
This document is intended to provide a guide to anyone either writing a frontend
|
|
|
|
for LLVM or working on optimization passes for LLVM with a guide for how to deal
|
|
|
|
with instructions with special semantics in the presence of concurrency. This
|
|
|
|
is not intended to be a precise guide to the semantics; the details can get
|
|
|
|
extremely complicated and unreadable, and are not usually necessary.
|
|
|
|
|
|
|
|
.. _Optimization outside atomic:
|
|
|
|
|
|
|
|
Optimization outside atomic
|
|
|
|
===========================
|
|
|
|
|
|
|
|
The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
|
|
|
|
lead to undefined results in a concurrent environment; see `NotAtomic`_. This
|
|
|
|
section specifically goes into the one optimizer restriction which applies in
|
|
|
|
concurrent environments, which gets a bit more of an extended description
|
|
|
|
because any optimization dealing with stores needs to be aware of it.
|
|
|
|
|
|
|
|
From the optimizer's point of view, the rule is that if there are not any
|
|
|
|
instructions with atomic ordering involved, concurrency does not matter, with
|
|
|
|
one exception: if a variable might be visible to another thread or signal
|
|
|
|
handler, a store cannot be inserted along a path where it might not execute
|
|
|
|
otherwise. Take the following example:
|
|
|
|
|
|
|
|
.. code-block:: c
|
|
|
|
|
|
|
|
/* C code, for readability; run through clang -O2 -S -emit-llvm to get
|
|
|
|
equivalent IR */
|
|
|
|
int x;
|
|
|
|
void f(int* a) {
|
|
|
|
for (int i = 0; i < 100; i++) {
|
|
|
|
if (a[i])
|
|
|
|
x += 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
The following is equivalent in non-concurrent situations:
|
|
|
|
|
|
|
|
.. code-block:: c
|
|
|
|
|
|
|
|
int x;
|
|
|
|
void f(int* a) {
|
|
|
|
int xtemp = x;
|
|
|
|
for (int i = 0; i < 100; i++) {
|
|
|
|
if (a[i])
|
|
|
|
xtemp += 1;
|
|
|
|
}
|
|
|
|
x = xtemp;
|
|
|
|
}
|
|
|
|
|
|
|
|
However, LLVM is not allowed to transform the former to the latter: it could
|
|
|
|
indirectly introduce undefined behavior if another thread can access ``x`` at
|
|
|
|
the same time. (This example is particularly of interest because before the
|
|
|
|
concurrency model was implemented, LLVM would perform this transformation.)
|
|
|
|
|
|
|
|
Note that speculative loads are allowed; a load which is part of a race returns
|
|
|
|
``undef``, but does not have undefined behavior.
|
|
|
|
|
|
|
|
Atomic instructions
|
|
|
|
===================
|
|
|
|
|
|
|
|
For cases where simple loads and stores are not sufficient, LLVM provides
|
|
|
|
various atomic instructions. The exact guarantees provided depend on the
|
|
|
|
ordering; see `Atomic orderings`_.
|
|
|
|
|
|
|
|
``load atomic`` and ``store atomic`` provide the same basic functionality as
|
|
|
|
non-atomic loads and stores, but provide additional guarantees in situations
|
|
|
|
where threads and signals are involved.
|
|
|
|
|
|
|
|
``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
|
|
|
|
atomic store (where the store is conditional for ``cmpxchg``), but no other
|
|
|
|
memory operation can happen on any thread between the load and store. Note that
|
|
|
|
LLVM's cmpxchg does not provide quite as many options as the C++0x version.
|
|
|
|
|
|
|
|
A ``fence`` provides Acquire and/or Release ordering which is not part of
|
|
|
|
another operation; it is normally used along with Monotonic memory operations.
|
|
|
|
A Monotonic load followed by an Acquire fence is roughly equivalent to an
|
|
|
|
Acquire load.
|
|
|
|
|
|
|
|
Frontends generating atomic instructions generally need to be aware of the
|
|
|
|
target to some degree; atomic instructions are guaranteed to be lock-free, and
|
|
|
|
therefore an instruction which is wider than the target natively supports can be
|
|
|
|
impossible to generate.
|
|
|
|
|
|
|
|
.. _Atomic orderings:
|
|
|
|
|
|
|
|
Atomic orderings
|
|
|
|
================
|
|
|
|
|
|
|
|
In order to achieve a balance between performance and necessary guarantees,
|
|
|
|
there are six levels of atomicity. They are listed in order of strength; each
|
|
|
|
level includes all the guarantees of the previous level except for
|
|
|
|
Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
|
|
|
|
|
|
|
|
.. _NotAtomic:
|
|
|
|
|
|
|
|
NotAtomic
|
|
|
|
---------
|
|
|
|
|
|
|
|
NotAtomic is the obvious, a load or store which is not atomic. (This isn't
|
|
|
|
really a level of atomicity, but is listed here for comparison.) This is
|
|
|
|
essentially a regular load or store. If there is a race on a given memory
|
|
|
|
location, loads from that location return undef.
|
|
|
|
|
|
|
|
Relevant standard
|
|
|
|
This is intended to match shared variables in C/C++, and to be used in any
|
|
|
|
other context where memory access is necessary, and a race is impossible. (The
|
|
|
|
precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
|
|
|
|
|
|
|
|
Notes for frontends
|
|
|
|
The rule is essentially that all memory accessed with basic loads and stores
|
|
|
|
by multiple threads should be protected by a lock or other synchronization;
|
|
|
|
otherwise, you are likely to run into undefined behavior. If your frontend is
|
|
|
|
for a "safe" language like Java, use Unordered to load and store any shared
|
|
|
|
variable. Note that NotAtomic volatile loads and stores are not properly
|
|
|
|
atomic; do not try to use them as a substitute. (Per the C/C++ standards,
|
|
|
|
volatile does provide some limited guarantees around asynchronous signals, but
|
|
|
|
atomics are generally a better solution.)
|
|
|
|
|
|
|
|
Notes for optimizers
|
|
|
|
Introducing loads to shared variables along a codepath where they would not
|
|
|
|
otherwise exist is allowed; introducing stores to shared variables is not. See
|
|
|
|
`Optimization outside atomic`_.
|
|
|
|
|
|
|
|
Notes for code generation
|
|
|
|
The one interesting restriction here is that it is not allowed to write to
|
|
|
|
bytes outside of the bytes relevant to a store. This is mostly relevant to
|
|
|
|
unaligned stores: it is not allowed in general to convert an unaligned store
|
|
|
|
into two aligned stores of the same width as the unaligned store. Backends are
|
|
|
|
also expected to generate an i8 store as an i8 store, and not an instruction
|
|
|
|
which writes to surrounding bytes. (If you are writing a backend for an
|
|
|
|
architecture which cannot satisfy these restrictions and cares about
|
|
|
|
concurrency, please send an email to llvmdev.)
|
|
|
|
|
|
|
|
Unordered
|
|
|
|
---------
|
|
|
|
|
|
|
|
Unordered is the lowest level of atomicity. It essentially guarantees that races
|
|
|
|
produce somewhat sane results instead of having undefined behavior. It also
|
|
|
|
guarantees the operation to be lock-free, so it do not depend on the data being
|
|
|
|
part of a special atomic structure or depend on a separate per-process global
|
|
|
|
lock. Note that code generation will fail for unsupported atomic operations; if
|
|
|
|
you need such an operation, use explicit locking.
|
|
|
|
|
|
|
|
Relevant standard
|
|
|
|
This is intended to match the Java memory model for shared variables.
|
|
|
|
|
|
|
|
Notes for frontends
|
|
|
|
This cannot be used for synchronization, but is useful for Java and other
|
|
|
|
"safe" languages which need to guarantee that the generated code never
|
|
|
|
exhibits undefined behavior. Note that this guarantee is cheap on common
|
|
|
|
platforms for loads of a native width, but can be expensive or unavailable for
|
|
|
|
wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
|
|
|
|
languages would normally split a 64-bit store on ARM into two 32-bit unordered
|
|
|
|
stores.)
|
|
|
|
|
|
|
|
Notes for optimizers
|
|
|
|
In terms of the optimizer, this prohibits any transformation that transforms a
|
|
|
|
single load into multiple loads, transforms a store into multiple stores,
|
|
|
|
narrows a store, or stores a value which would not be stored otherwise. Some
|
|
|
|
examples of unsafe optimizations are narrowing an assignment into a bitfield,
|
|
|
|
rematerializing a load, and turning loads and stores into a memcpy
|
|
|
|
call. Reordering unordered operations is safe, though, and optimizers should
|
|
|
|
take advantage of that because unordered operations are common in languages
|
|
|
|
that need them.
|
|
|
|
|
|
|
|
Notes for code generation
|
|
|
|
These operations are required to be atomic in the sense that if you use
|
|
|
|
unordered loads and unordered stores, a load cannot see a value which was
|
|
|
|
never stored. A normal load or store instruction is usually sufficient, but
|
|
|
|
note that an unordered load or store cannot be split into multiple
|
|
|
|
instructions (or an instruction which does multiple memory operations, like
|
|
|
|
``LDRD`` on ARM).
|
|
|
|
|
|
|
|
Monotonic
|
|
|
|
---------
|
|
|
|
|
|
|
|
Monotonic is the weakest level of atomicity that can be used in synchronization
|
|
|
|
primitives, although it does not provide any general synchronization. It
|
|
|
|
essentially guarantees that if you take all the operations affecting a specific
|
|
|
|
address, a consistent ordering exists.
|
|
|
|
|
|
|
|
Relevant standard
|
|
|
|
This corresponds to the C++0x/C1x ``memory_order_relaxed``; see those
|
|
|
|
standards for the exact definition.
|
|
|
|
|
|
|
|
Notes for frontends
|
|
|
|
If you are writing a frontend which uses this directly, use with caution. The
|
|
|
|
guarantees in terms of synchronization are very weak, so make sure these are
|
|
|
|
only used in a pattern which you know is correct. Generally, these would
|
|
|
|
either be used for atomic operations which do not protect other memory (like
|
|
|
|
an atomic counter), or along with a ``fence``.
|
|
|
|
|
|
|
|
Notes for optimizers
|
|
|
|
In terms of the optimizer, this can be treated as a read+write on the relevant
|
|
|
|
memory location (and alias analysis will take advantage of that). In addition,
|
|
|
|
it is legal to reorder non-atomic and Unordered loads around Monotonic
|
|
|
|
loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
|
|
|
|
operations are unlikely to be used in ways which would make those
|
|
|
|
optimizations useful.
|
|
|
|
|
|
|
|
Notes for code generation
|
|
|
|
Code generation is essentially the same as that for unordered for loads and
|
|
|
|
stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required
|
|
|
|
to appear as a single operation.
|
|
|
|
|
|
|
|
Acquire
|
|
|
|
-------
|
|
|
|
|
|
|
|
Acquire provides a barrier of the sort necessary to acquire a lock to access
|
|
|
|
other memory with normal loads and stores.
|
|
|
|
|
|
|
|
Relevant standard
|
|
|
|
This corresponds to the C++0x/C1x ``memory_order_acquire``. It should also be
|
|
|
|
used for C++0x/C1x ``memory_order_consume``.
|
|
|
|
|
|
|
|
Notes for frontends
|
|
|
|
If you are writing a frontend which uses this directly, use with caution.
|
|
|
|
Acquire only provides a semantic guarantee when paired with a Release
|
|
|
|
operation.
|
|
|
|
|
|
|
|
Notes for optimizers
|
|
|
|
Optimizers not aware of atomics can treat this like a nothrow call. It is
|
|
|
|
also possible to move stores from before an Acquire load or read-modify-write
|
|
|
|
operation to after it, and move non-Acquire loads from before an Acquire
|
|
|
|
operation to after it.
|
|
|
|
|
|
|
|
Notes for code generation
|
|
|
|
Architectures with weak memory ordering (essentially everything relevant today
|
|
|
|
except x86 and SPARC) require some sort of fence to maintain the Acquire
|
|
|
|
semantics. The precise fences required varies widely by architecture, but for
|
|
|
|
a simple implementation, most architectures provide a barrier which is strong
|
|
|
|
enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting
|
|
|
|
such a fence after the equivalent Monotonic operation is sufficient to
|
|
|
|
maintain Acquire semantics for a memory operation.
|
|
|
|
|
|
|
|
Release
|
|
|
|
-------
|
|
|
|
|
|
|
|
Release is similar to Acquire, but with a barrier of the sort necessary to
|
|
|
|
release a lock.
|
|
|
|
|
|
|
|
Relevant standard
|
|
|
|
This corresponds to the C++0x/C1x ``memory_order_release``.
|
|
|
|
|
|
|
|
Notes for frontends
|
|
|
|
If you are writing a frontend which uses this directly, use with caution.
|
|
|
|
Release only provides a semantic guarantee when paired with a Acquire
|
|
|
|
operation.
|
|
|
|
|
|
|
|
Notes for optimizers
|
|
|
|
Optimizers not aware of atomics can treat this like a nothrow call. It is
|
|
|
|
also possible to move loads from after a Release store or read-modify-write
|
|
|
|
operation to before it, and move non-Release stores from after an Release
|
|
|
|
operation to before it.
|
|
|
|
|
|
|
|
Notes for code generation
|
|
|
|
See the section on Acquire; a fence before the relevant operation is usually
|
|
|
|
sufficient for Release. Note that a store-store fence is not sufficient to
|
|
|
|
implement Release semantics; store-store fences are generally not exposed to
|
|
|
|
IR because they are extremely difficult to use correctly.
|
|
|
|
|
|
|
|
AcquireRelease
|
|
|
|
--------------
|
|
|
|
|
|
|
|
AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
|
|
|
|
barrier (for fences and operations which both read and write memory).
|
|
|
|
|
|
|
|
Relevant standard
|
|
|
|
This corresponds to the C++0x/C1x ``memory_order_acq_rel``.
|
|
|
|
|
|
|
|
Notes for frontends
|
|
|
|
If you are writing a frontend which uses this directly, use with caution.
|
|
|
|
Acquire only provides a semantic guarantee when paired with a Release
|
|
|
|
operation, and vice versa.
|
|
|
|
|
|
|
|
Notes for optimizers
|
2012-07-23 08:51:15 +00:00
|
|
|
In general, optimizers should treat this like a nothrow call; the possible
|
2012-06-29 09:00:01 +00:00
|
|
|
optimizations are usually not interesting.
|
|
|
|
|
|
|
|
Notes for code generation
|
|
|
|
This operation has Acquire and Release semantics; see the sections on Acquire
|
|
|
|
and Release.
|
|
|
|
|
|
|
|
SequentiallyConsistent
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
|
|
|
|
and Release semantics for stores. Additionally, it guarantees that a total
|
|
|
|
ordering exists between all SequentiallyConsistent operations.
|
|
|
|
|
|
|
|
Relevant standard
|
|
|
|
This corresponds to the C++0x/C1x ``memory_order_seq_cst``, Java volatile, and
|
|
|
|
the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
|
|
|
|
|
|
|
|
Notes for frontends
|
|
|
|
If a frontend is exposing atomic operations, these are much easier to reason
|
|
|
|
about for the programmer than other kinds of operations, and using them is
|
|
|
|
generally a practical performance tradeoff.
|
|
|
|
|
|
|
|
Notes for optimizers
|
|
|
|
Optimizers not aware of atomics can treat this like a nothrow call. For
|
|
|
|
SequentiallyConsistent loads and stores, the same reorderings are allowed as
|
|
|
|
for Acquire loads and Release stores, except that SequentiallyConsistent
|
|
|
|
operations may not be reordered.
|
|
|
|
|
|
|
|
Notes for code generation
|
|
|
|
SequentiallyConsistent loads minimally require the same barriers as Acquire
|
|
|
|
operations and SequentiallyConsistent stores require Release
|
|
|
|
barriers. Additionally, the code generator must enforce ordering between
|
|
|
|
SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
|
|
|
|
is usually done by emitting either a full fence before the loads or a full
|
|
|
|
fence after the stores; which is preferred varies by architecture.
|
|
|
|
|
|
|
|
Atomics and IR optimization
|
|
|
|
===========================
|
|
|
|
|
|
|
|
Predicates for optimizer writers to query:
|
|
|
|
|
|
|
|
* ``isSimple()``: A load or store which is not volatile or atomic. This is
|
|
|
|
what, for example, memcpyopt would check for operations it might transform.
|
|
|
|
|
|
|
|
* ``isUnordered()``: A load or store which is not volatile and at most
|
|
|
|
Unordered. This would be checked, for example, by LICM before hoisting an
|
|
|
|
operation.
|
|
|
|
|
|
|
|
* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
|
|
|
|
that they return true for any operation which is volatile or at least
|
|
|
|
Monotonic.
|
|
|
|
|
|
|
|
* Alias analysis: Note that AA will return ModRef for anything Acquire or
|
|
|
|
Release, and for the address accessed by any Monotonic operation.
|
|
|
|
|
|
|
|
To support optimizing around atomic operations, make sure you are using the
|
|
|
|
right predicates; everything should work if that is done. If your pass should
|
|
|
|
optimize some atomic operations (Unordered operations in particular), make sure
|
|
|
|
it doesn't replace an atomic load or store with a non-atomic operation.
|
|
|
|
|
|
|
|
Some examples of how optimizations interact with various kinds of atomic
|
|
|
|
operations:
|
|
|
|
|
|
|
|
* ``memcpyopt``: An atomic operation cannot be optimized into part of a
|
|
|
|
memcpy/memset, including unordered loads/stores. It can pull operations
|
|
|
|
across some atomic operations.
|
|
|
|
|
|
|
|
* LICM: Unordered loads/stores can be moved out of a loop. It just treats
|
|
|
|
monotonic operations like a read+write to a memory location, and anything
|
|
|
|
stricter than that like a nothrow call.
|
|
|
|
|
|
|
|
* DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can
|
|
|
|
be DSE'ed in some cases, but it's tricky to reason about, and not especially
|
|
|
|
important.
|
|
|
|
|
|
|
|
* Folding a load: Any atomic load from a constant global can be constant-folded,
|
|
|
|
because it cannot be observed. Similar reasoning allows scalarrepl with
|
|
|
|
atomic loads and stores.
|
|
|
|
|
|
|
|
Atomics and Codegen
|
|
|
|
===================
|
|
|
|
|
|
|
|
Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
|
|
|
|
On architectures which use barrier instructions for all atomic ordering (like
|
|
|
|
ARM), appropriate fences are split out as the DAG is built.
|
|
|
|
|
|
|
|
The MachineMemOperand for all atomic operations is currently marked as volatile;
|
|
|
|
this is not correct in the IR sense of volatile, but CodeGen handles anything
|
|
|
|
marked volatile very conservatively. This should get fixed at some point.
|
|
|
|
|
|
|
|
Common architectures have some way of representing at least a pointer-sized
|
|
|
|
lock-free ``cmpxchg``; such an operation can be used to implement all the other
|
|
|
|
atomic operations which can be represented in IR up to that size. Backends are
|
|
|
|
expected to implement all those operations, but not operations which cannot be
|
|
|
|
implemented in a lock-free manner. It is expected that backends will give an
|
|
|
|
error when given an operation which cannot be implemented. (The LLVM code
|
|
|
|
generator is not very helpful here at the moment, but hopefully that will
|
|
|
|
change.)
|
|
|
|
|
|
|
|
The implementation of atomics on LL/SC architectures (like ARM) is currently a
|
|
|
|
bit of a mess; there is a lot of copy-pasted code across targets, and the
|
|
|
|
representation is relatively unsuited to optimization (it would be nice to be
|
|
|
|
able to optimize loops involving cmpxchg etc.).
|
|
|
|
|
|
|
|
On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
|
|
|
|
generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
|
|
|
|
fences generate an ``MFENCE``, other fences do not cause any code to be
|
|
|
|
generated. cmpxchg uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg``
|
|
|
|
uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
|
|
|
|
other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending
|
|
|
|
on the users of the result, some ``atomicrmw`` operations can be translated into
|
|
|
|
operations like ``LOCK AND``, but that does not work in general.
|
|
|
|
|
|
|
|
On ARM, MIPS, and many other RISC architectures, Acquire, Release, and
|
|
|
|
SequentiallyConsistent semantics require barrier instructions for every such
|
|
|
|
operation. Loads and stores generate normal instructions. ``cmpxchg`` and
|
|
|
|
``atomicrmw`` can be represented using a loop with LL/SC-style instructions
|
|
|
|
which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
|
|
|
|
on ARM, etc.). At the moment, the IR does not provide any way to represent a
|
|
|
|
weak ``cmpxchg`` which would not require a loop.
|