Sphinxify the Atomics documentation.

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@159416 91177308-0d34-0410-b5e6-96231b3b80d8
2025-02-27 02:31:09 +00:00 · 2012-06-29 09:00:01 +00:00 · 2012-06-29 09:00:01 +00:00 · 3bf24bdb1b
commit 3bf24bdb1b
parent 16eeb6f5eb
2 changed files with 441 additions and 569 deletions
--- a/docs/Atomics.html
+++ b/docs/Atomics.html
@ -1,569 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
-                      "http://www.w3.org/TR/html4/strict.dtd">
-<html>
-<head>
-  <title>LLVM Atomic Instructions and Concurrency Guide</title>
-  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
-  <link rel="stylesheet" href="_static/llvm.css" type="text/css">
-</head>
-<body>
-
-<h1>
-  LLVM Atomic Instructions and Concurrency Guide
-</h1>
-
-<ol>
-  <li><a href="#introduction">Introduction</a></li>
-  <li><a href="#outsideatomic">Optimization outside atomic</a></li>
-  <li><a href="#atomicinst">Atomic instructions</a></li>
-  <li><a href="#ordering">Atomic orderings</a></li>
-  <li><a href="#iropt">Atomics and IR optimization</a></li>
-  <li><a href="#codegen">Atomics and Codegen</a></li>
-</ol>
-
-<div class="doc_author">
-  <p>Written by Eli Friedman</p>
-</div>
-
-<!-- *********************************************************************** -->
-<h2>
-  <a name="introduction">Introduction</a>
-</h2>
-<!-- *********************************************************************** -->
-
-<div>
-
-<p>Historically, LLVM has not had very strong support for concurrency; some
-minimal intrinsics were provided, and <code>volatile</code> was used in some
-cases to achieve rough semantics in the presence of concurrency.  However, this
-is changing; there are now new instructions which are well-defined in the
-presence of threads and asynchronous signals, and the model for existing
-instructions has been clarified in the IR.</p>
-
-<p>The atomic instructions are designed specifically to provide readable IR and
-   optimized code generation for the following:</p>
-<ul>
-  <li>The new C++0x <code>&lt;atomic&gt;</code> header.
-      (<a href="http://www.open-std.org/jtc1/sc22/wg21/">C++0x draft available here</a>.)
-      (<a href="http://www.open-std.org/jtc1/sc22/wg14/">C1x draft available here</a>)</li>
-  <li>Proper semantics for Java-style memory, for both <code>volatile</code> and
-      regular shared variables.
-      (<a href="http://java.sun.com/docs/books/jls/third_edition/html/memory.html">Java Specification</a>)</li>
-  <li>gcc-compatible <code>__sync_*</code> builtins.
-      (<a href="http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html">Description</a>)</li>
-  <li>Other scenarios with atomic semantics, including <code>static</code>
-      variables with non-trivial constructors in C++.</li>
-</ul>
-
-<p>Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++
-   volatile, which ensures that every volatile load and store happens and is
-   performed in the stated order.  A couple examples: if a
-   SequentiallyConsistent store is immediately followed by another
-   SequentiallyConsistent store to the same address, the first store can
-   be erased. This transformation is not allowed for a pair of volatile
-   stores. On the other hand, a non-volatile non-atomic load can be moved
-   across a volatile load freely, but not an Acquire load.</p>
-
-<p>This document is intended to provide a guide to anyone either writing a
-   frontend for LLVM or working on optimization passes for LLVM with a guide
-   for how to deal with instructions with special semantics in the presence of
-   concurrency.  This is not intended to be a precise guide to the semantics;
-   the details can get extremely complicated and unreadable, and are not
-   usually necessary.</p>
-
-</div>
-
-<!-- *********************************************************************** -->
-<h2>
-  <a name="outsideatomic">Optimization outside atomic</a>
-</h2>
-<!-- *********************************************************************** -->
-
-<div>
-
-<p>The basic <code>'load'</code> and <code>'store'</code> allow a variety of 
-   optimizations, but can lead to undefined results in a concurrent environment;
-   see <a href="#o_nonatomic">NonAtomic</a>. This section specifically goes
-   into the one optimizer restriction which applies in concurrent environments,
-   which gets a bit more of an extended description because any optimization
-   dealing with stores needs to be aware of it.</p>
-
-<p>From the optimizer's point of view, the rule is that if there
-   are not any instructions with atomic ordering involved, concurrency does
-   not matter, with one exception: if a variable might be visible to another
-   thread or signal handler, a store cannot be inserted along a path where it
-   might not execute otherwise.  Take the following example:</p>
-
-<pre>
-/* C code, for readability; run through clang -O2 -S -emit-llvm to get
-   equivalent IR */
-int x;
-void f(int* a) {
-  for (int i = 0; i &lt; 100; i++) {
-    if (a[i])
-      x += 1;
-  }
-}
-</pre>
-
-<p>The following is equivalent in non-concurrent situations:</p>
-
-<pre>
-int x;
-void f(int* a) {
-  int xtemp = x;
-  for (int i = 0; i &lt; 100; i++) {
-    if (a[i])
-      xtemp += 1;
-  }
-  x = xtemp;
-}
-</pre>
-
-<p>However, LLVM is not allowed to transform the former to the latter: it could
-   indirectly introduce undefined behavior if another thread can access x at
-   the same time. (This example is particularly of interest because before the
-   concurrency model was implemented, LLVM would perform this
-   transformation.)</p>
-
-<p>Note that speculative loads are allowed; a load which
-   is part of a race returns <code>undef</code>, but does not have undefined
-   behavior.</p>
-
-
-</div>
-
-<!-- *********************************************************************** -->
-<h2>
-  <a name="atomicinst">Atomic instructions</a>
-</h2>
-<!-- *********************************************************************** -->
-
-<div>
-
-<p>For cases where simple loads and stores are not sufficient, LLVM provides
-   various atomic instructions. The exact guarantees provided depend on the
-   ordering; see <a href="#ordering">Atomic orderings</a></p>
-
-<p><code>load atomic</code> and <code>store atomic</code> provide the same
-   basic functionality as non-atomic loads and stores, but provide additional
-   guarantees in situations where threads and signals are involved.</p>
-
-<p><code>cmpxchg</code> and <code>atomicrmw</code> are essentially like an
-   atomic load followed by an atomic store (where the store is conditional for
-   <code>cmpxchg</code>), but no other memory operation can happen on any thread
-   between the load and store.  Note that LLVM's cmpxchg does not provide quite
-   as many options as the C++0x version.</p>
-
-<p>A <code>fence</code> provides Acquire and/or Release ordering which is not
-   part of another operation; it is normally used along with Monotonic memory
-   operations.  A Monotonic load followed by an Acquire fence is roughly
-   equivalent to an Acquire load.</p>
-
-<p>Frontends generating atomic instructions generally need to be aware of the
-   target to some degree; atomic instructions are guaranteed to be lock-free,
-   and therefore an instruction which is wider than the target natively supports
-   can be impossible to generate.</p>
-
-</div>
-
-<!-- *********************************************************************** -->
-<h2>
-  <a name="ordering">Atomic orderings</a>
-</h2>
-<!-- *********************************************************************** -->
-
-<div>
-
-<p>In order to achieve a balance between performance and necessary guarantees,
-   there are six levels of atomicity. They are listed in order of strength;
-   each level includes all the guarantees of the previous level except for
-   Acquire/Release. (See also <a href="LangRef.html#ordering">LangRef</a>.)</p>
-
-<!-- ======================================================================= -->
-<h3>
-     <a name="o_notatomic">NotAtomic</a>
-</h3>
-
-<div>
-
-<p>NotAtomic is the obvious, a load or store which is not atomic. (This isn't
-   really a level of atomicity, but is listed here for comparison.) This is
-   essentially a regular load or store. If there is a race on a given memory
-   location, loads from that location return undef.</p>
-
-<dl>
-  <dt>Relevant standard</dt>
-  <dd>This is intended to match shared variables in C/C++, and to be used
-      in any other context where memory access is necessary, and
-      a race is impossible. (The precise definition is in
-      <a href="LangRef.html#memmodel">LangRef</a>.)
-  <dt>Notes for frontends</dt>
-  <dd>The rule is essentially that all memory accessed with basic loads and
-      stores by multiple threads should be protected by a lock or other
-      synchronization; otherwise, you are likely to run into undefined
-      behavior. If your frontend is for a "safe" language like Java,
-      use Unordered to load and store any shared variable.  Note that NotAtomic
-      volatile loads and stores are not properly atomic; do not try to use
-      them as a substitute. (Per the C/C++ standards, volatile does provide
-      some limited guarantees around asynchronous signals, but atomics are
-      generally a better solution.)
-  <dt>Notes for optimizers</dt>
-  <dd>Introducing loads to shared variables along a codepath where they would
-      not otherwise exist is allowed; introducing stores to shared variables
-      is not. See <a href="#outsideatomic">Optimization outside
-      atomic</a>.</dd>
-  <dt>Notes for code generation</dt>
-  <dd>The one interesting restriction here is that it is not allowed to write
-      to bytes outside of the bytes relevant to a store.  This is mostly
-      relevant to unaligned stores: it is not allowed in general to convert
-      an unaligned store into two aligned stores of the same width as the
-      unaligned store. Backends are also expected to generate an i8 store
-      as an i8 store, and not an instruction which writes to surrounding
-      bytes.  (If you are writing a backend for an architecture which cannot
-      satisfy these restrictions and cares about concurrency, please send an
-      email to llvmdev.)</dd>
-</dl>
-
-</div>
-
-
-<!-- ======================================================================= -->
-<h3>
-     <a name="o_unordered">Unordered</a>
-</h3>
-
-<div>
-
-<p>Unordered is the lowest level of atomicity. It essentially guarantees that
-   races produce somewhat sane results instead of having undefined behavior.
-   It also guarantees the operation to be lock-free, so it do not depend on
-   the data being part of a special atomic structure or depend on a separate
-   per-process global lock.  Note that code generation will fail for
-   unsupported atomic operations; if you need such an operation, use explicit
-   locking.</p>
-
-<dl>
-  <dt>Relevant standard</dt>
-  <dd>This is intended to match the Java memory model for shared
-      variables.</dd>
-  <dt>Notes for frontends</dt>
-  <dd>This cannot be used for synchronization, but is useful for Java and
-      other "safe" languages which need to guarantee that the generated
-      code never exhibits undefined behavior. Note that this guarantee
-      is cheap on common platforms for loads of a native width, but can
-      be expensive or unavailable for wider loads, like a 64-bit store
-      on ARM. (A frontend for Java or other "safe" languages would normally
-      split a 64-bit store on ARM into two 32-bit unordered stores.)
-  <dt>Notes for optimizers</dt>
-  <dd>In terms of the optimizer, this prohibits any transformation that
-      transforms a single load into multiple loads, transforms a store
-      into multiple stores, narrows a store, or stores a value which
-      would not be stored otherwise.  Some examples of unsafe optimizations
-      are narrowing an assignment into a bitfield, rematerializing
-      a load, and turning loads and stores into a memcpy call. Reordering
-      unordered operations is safe, though, and optimizers should take 
-      advantage of that because unordered operations are common in
-      languages that need them.</dd>
-  <dt>Notes for code generation</dt>
-  <dd>These operations are required to be atomic in the sense that if you
-      use unordered loads and unordered stores, a load cannot see a value
-      which was never stored.  A normal load or store instruction is usually
-      sufficient, but note that an unordered load or store cannot
-      be split into multiple instructions (or an instruction which
-      does multiple memory operations, like <code>LDRD</code> on ARM).</dd>
-</dl>
-
-</div>
-
-<!-- ======================================================================= -->
-<h3>
-     <a name="o_monotonic">Monotonic</a>
-</h3>
-
-<div>
-
-<p>Monotonic is the weakest level of atomicity that can be used in
-   synchronization primitives, although it does not provide any general
-   synchronization. It essentially guarantees that if you take all the
-   operations affecting a specific address, a consistent ordering exists.
-
-<dl>
-  <dt>Relevant standard</dt>
-  <dd>This corresponds to the C++0x/C1x <code>memory_order_relaxed</code>;
-     see those standards for the exact definition.
-  <dt>Notes for frontends</dt>
-  <dd>If you are writing a frontend which uses this directly, use with caution.
-      The guarantees in terms of synchronization are very weak, so make
-      sure these are only used in a pattern which you know is correct.
-      Generally, these would either be used for atomic operations which
-      do not protect other memory (like an atomic counter), or along with
-      a <code>fence</code>.</dd>
-  <dt>Notes for optimizers</dt>
-  <dd>In terms of the optimizer, this can be treated as a read+write on the
-      relevant memory location (and alias analysis will take advantage of
-      that). In addition, it is legal to reorder non-atomic and Unordered
-      loads around Monotonic loads. CSE/DSE and a few other optimizations
-      are allowed, but Monotonic operations are unlikely to be used in ways
-      which would make those optimizations useful.</dd>
-  <dt>Notes for code generation</dt>
-  <dd>Code generation is essentially the same as that for unordered for loads
-     and stores.  No fences are required.  <code>cmpxchg</code> and 
-     <code>atomicrmw</code> are required to appear as a single operation.</dd>
-</dl>
-
-</div>
-
-<!-- ======================================================================= -->
-<h3>
-     <a name="o_acquire">Acquire</a>
-</h3>
-
-<div>
-
-<p>Acquire provides a barrier of the sort necessary to acquire a lock to access
-   other memory with normal loads and stores.
-
-<dl>
-  <dt>Relevant standard</dt>
-  <dd>This corresponds to the C++0x/C1x <code>memory_order_acquire</code>. It
-      should also be used for C++0x/C1x <code>memory_order_consume</code>.
-  <dt>Notes for frontends</dt>
-  <dd>If you are writing a frontend which uses this directly, use with caution.
-      Acquire only provides a semantic guarantee when paired with a Release
-      operation.</dd>
-  <dt>Notes for optimizers</dt>
-  <dd>Optimizers not aware of atomics can treat this like a nothrow call.
-      It is also possible to move stores from before an Acquire load
-      or read-modify-write operation to after it, and move non-Acquire
-      loads from before an Acquire operation to after it.</dd>
-  <dt>Notes for code generation</dt>
-  <dd>Architectures with weak memory ordering (essentially everything relevant
-      today except x86 and SPARC) require some sort of fence to maintain
-      the Acquire semantics.  The precise fences required varies widely by
-      architecture, but for a simple implementation, most architectures provide
-      a barrier which is strong enough for everything (<code>dmb</code> on ARM,
-      <code>sync</code> on PowerPC, etc.).  Putting such a fence after the
-      equivalent Monotonic operation is sufficient to maintain Acquire
-      semantics for a memory operation.</dd>
-</dl>
-
-</div>
-
-<!-- ======================================================================= -->
-<h3>
-     <a name="o_acquire">Release</a>
-</h3>
-
-<div>
-
-<p>Release is similar to Acquire, but with a barrier of the sort necessary to
-   release a lock.
-
-<dl>
-  <dt>Relevant standard</dt>
-  <dd>This corresponds to the C++0x/C1x <code>memory_order_release</code>.</dd>
-  <dt>Notes for frontends</dt>
-  <dd>If you are writing a frontend which uses this directly, use with caution.
-      Release only provides a semantic guarantee when paired with a Acquire
-      operation.</dd>
-  <dt>Notes for optimizers</dt>
-  <dd>Optimizers not aware of atomics can treat this like a nothrow call.
-      It is also possible to move loads from after a Release store
-      or read-modify-write operation to before it, and move non-Release
-      stores from after an Release operation to before it.</dd>
-  <dt>Notes for code generation</dt>
-  <dd>See the section on Acquire; a fence before the relevant operation is
-      usually sufficient for Release. Note that a store-store fence is not
-      sufficient to implement Release semantics; store-store fences are
-      generally not exposed to IR because they are extremely difficult to
-      use correctly.</dd>
-</dl>
-
-</div>
-
-<!-- ======================================================================= -->
-<h3>
-     <a name="o_acqrel">AcquireRelease</a>
-</h3>
-
-<div>
-
-<p>AcquireRelease (<code>acq_rel</code> in IR) provides both an Acquire and a
-   Release barrier (for fences and operations which both read and write memory).
-
-<dl>
-  <dt>Relevant standard</dt>
-  <dd>This corresponds to the C++0x/C1x <code>memory_order_acq_rel</code>.
-  <dt>Notes for frontends</dt>
-  <dd>If you are writing a frontend which uses this directly, use with caution.
-      Acquire only provides a semantic guarantee when paired with a Release
-      operation, and vice versa.</dd>
-  <dt>Notes for optimizers</dt>
-  <dd>In general, optimizers should treat this like a nothrow call; the
-      the possible optimizations are usually not interesting.</dd>
-  <dt>Notes for code generation</dt>
-  <dd>This operation has Acquire and Release semantics; see the sections on
-      Acquire and Release.</dd>
-</dl>
-
-</div>
-
-<!-- ======================================================================= -->
-<h3>
-     <a name="o_seqcst">SequentiallyConsistent</a>
-</h3>
-
-<div>
-
-<p>SequentiallyConsistent (<code>seq_cst</code> in IR) provides
-   Acquire semantics for loads and Release semantics for
-   stores. Additionally, it guarantees that a total ordering exists
-   between all SequentiallyConsistent operations.
-
-<dl>
-  <dt>Relevant standard</dt>
-  <dd>This corresponds to the C++0x/C1x <code>memory_order_seq_cst</code>,
-      Java volatile, and the gcc-compatible <code>__sync_*</code> builtins
-      which do not specify otherwise.
-  <dt>Notes for frontends</dt>
-  <dd>If a frontend is exposing atomic operations, these are much easier to
-      reason about for the programmer than other kinds of operations, and using
-      them is generally a practical performance tradeoff.</dd>
-  <dt>Notes for optimizers</dt>
-  <dd>Optimizers not aware of atomics can treat this like a nothrow call.
-      For SequentiallyConsistent loads and stores, the same reorderings are
-      allowed as for Acquire loads and Release stores, except that
-      SequentiallyConsistent operations may not be reordered.</dd>
-  <dt>Notes for code generation</dt>
-  <dd>SequentiallyConsistent loads minimally require the same barriers
-     as Acquire operations and SequentiallyConsistent stores require
-     Release barriers. Additionally, the code generator must enforce
-     ordering between SequentiallyConsistent stores followed by
-     SequentiallyConsistent loads. This is usually done by emitting
-     either a full fence before the loads or a full fence after the
-     stores; which is preferred varies by architecture.</dd>
-</dl>
-
-</div>
-
-</div>
-
-<!-- *********************************************************************** -->
-<h2>
-  <a name="iropt">Atomics and IR optimization</a>
-</h2>
-<!-- *********************************************************************** -->
-
-<div>
-
-<p>Predicates for optimizer writers to query:
-<ul>
-  <li>isSimple(): A load or store which is not volatile or atomic.  This is
-      what, for example, memcpyopt would check for operations it might
-      transform.</li>
-  <li>isUnordered(): A load or store which is not volatile and at most
-      Unordered. This would be checked, for example, by LICM before hoisting
-      an operation.</li>
-  <li>mayReadFromMemory()/mayWriteToMemory(): Existing predicate, but note
-      that they return true for any operation which is volatile or at least
-      Monotonic.</li>
-  <li>Alias analysis: Note that AA will return ModRef for anything Acquire or
-      Release, and for the address accessed by any Monotonic operation.</li>
-</ul>
-
-<p>To support optimizing around atomic operations, make sure you are using
-   the right predicates; everything should work if that is done.  If your
-   pass should optimize some atomic operations (Unordered operations in
-   particular), make sure it doesn't replace an atomic load or store with
-   a non-atomic operation.</p>
-
-<p>Some examples of how optimizations interact with various kinds of atomic
-   operations:
-<ul>
-  <li>memcpyopt: An atomic operation cannot be optimized into part of a
-      memcpy/memset, including unordered loads/stores.  It can pull operations
-      across some atomic operations.
-  <li>LICM: Unordered loads/stores can be moved out of a loop.  It just treats
-      monotonic operations like a read+write to a memory location, and anything
-      stricter than that like a nothrow call.
-  <li>DSE: Unordered stores can be DSE'ed like normal stores.  Monotonic stores
-      can be DSE'ed in some cases, but it's tricky to reason about, and not
-      especially important.
-  <li>Folding a load: Any atomic load from a constant global can be
-      constant-folded, because it cannot be observed.  Similar reasoning allows
-      scalarrepl with atomic loads and stores.
-</ul>
-
-</div>
-
-<!-- *********************************************************************** -->
-<h2>
-  <a name="codegen">Atomics and Codegen</a>
-</h2>
-<!-- *********************************************************************** -->
-
-<div>
-
-<p>Atomic operations are represented in the SelectionDAG with
-   <code>ATOMIC_*</code> opcodes.  On architectures which use barrier
-   instructions for all atomic ordering (like ARM), appropriate fences are
-   split out as the DAG is built.</p>
-
-<p>The MachineMemOperand for all atomic operations is currently marked as
-   volatile; this is not correct in the IR sense of volatile, but CodeGen
-   handles anything marked volatile very conservatively.  This should get
-   fixed at some point.</p>
-
-<p>Common architectures have some way of representing at least a pointer-sized
-   lock-free <code>cmpxchg</code>; such an operation can be used to implement
-   all the other atomic operations which can be represented in IR up to that
-   size.  Backends are expected to implement all those operations, but not
-   operations which cannot be implemented in a lock-free manner.  It is
-   expected that backends will give an error when given an operation which
-   cannot be implemented.  (The LLVM code generator is not very helpful here
-   at the moment, but hopefully that will change.)</p>
-
-<p>The implementation of atomics on LL/SC architectures (like ARM) is currently
-   a bit of a mess; there is a lot of copy-pasted code across targets, and
-   the representation is relatively unsuited to optimization (it would be nice
-   to be able to optimize loops involving cmpxchg etc.).</p>
-
-<p>On x86, all atomic loads generate a <code>MOV</code>.
-   SequentiallyConsistent stores generate an <code>XCHG</code>, other stores
-   generate a <code>MOV</code>. SequentiallyConsistent fences generate an
-   <code>MFENCE</code>, other fences do not cause any code to be generated.
-   cmpxchg uses the <code>LOCK CMPXCHG</code> instruction.
-   <code>atomicrmw xchg</code> uses <code>XCHG</code>,
-   <code>atomicrmw add</code> and <code>atomicrmw sub</code> use
-   <code>XADD</code>, and all other <code>atomicrmw</code> operations generate
-   a loop with <code>LOCK CMPXCHG</code>.  Depending on the users of the
-   result, some <code>atomicrmw</code> operations can be translated into
-   operations like <code>LOCK AND</code>, but that does not work in
-   general.</p>
-
-<p>On ARM, MIPS, and many other RISC architectures, Acquire, Release, and
-   SequentiallyConsistent semantics require barrier instructions
-   for every such operation. Loads and stores generate normal instructions.
-   <code>cmpxchg</code> and <code>atomicrmw</code> can be represented using
-   a loop with LL/SC-style instructions which take some sort of exclusive
-   lock on a cache line  (<code>LDREX</code> and <code>STREX</code> on
-   ARM, etc.). At the moment, the IR does not provide any way to represent a
-   weak <code>cmpxchg</code> which would not require a loop.</p>
-</div>
-
-<!-- *********************************************************************** -->
-
-<hr>
-<address>
-  <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
-  src="http://jigsaw.w3.org/css-validator/images/vcss-blue" alt="Valid CSS"></a>
-  <a href="http://validator.w3.org/check/referer"><img
-  src="http://www.w3.org/Icons/valid-html401-blue" alt="Valid HTML 4.01"></a>
-
-  <a href="http://llvm.org/">LLVM Compiler Infrastructure</a><br>
-  Last modified: $Date: 2011-08-09 02:07:00 -0700 (Tue, 09 Aug 2011) $
-</address>
-
-</body>
-</html>
--- a/docs/Atomics.rst
+++ b/docs/Atomics.rst
@ -0,0 +1,441 @@
+.. _atomics:
+
+==============================================
+LLVM Atomic Instructions and Concurrency Guide
+==============================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+Historically, LLVM has not had very strong support for concurrency; some minimal
+intrinsics were provided, and ``volatile`` was used in some cases to achieve
+rough semantics in the presence of concurrency.  However, this is changing;
+there are now new instructions which are well-defined in the presence of threads
+and asynchronous signals, and the model for existing instructions has been
+clarified in the IR.
+
+The atomic instructions are designed specifically to provide readable IR and
+optimized code generation for the following:
+
+* The new C++0x ``<atomic>`` header.  (`C++0x draft available here
+  <http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C1x draft available here
+  <http://www.open-std.org/jtc1/sc22/wg14/>`_.)
+
+* Proper semantics for Java-style memory, for both ``volatile`` and regular
+  shared variables. (`Java Specification
+  <http://java.sun.com/docs/books/jls/third_edition/html/memory.html>`_)
+
+* gcc-compatible ``__sync_*`` builtins. (`Description
+  <http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html>`_)
+
+* Other scenarios with atomic semantics, including ``static`` variables with
+  non-trivial constructors in C++.
+
+Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
+which ensures that every volatile load and store happens and is performed in the
+stated order.  A couple examples: if a SequentiallyConsistent store is
+immediately followed by another SequentiallyConsistent store to the same
+address, the first store can be erased. This transformation is not allowed for a
+pair of volatile stores. On the other hand, a non-volatile non-atomic load can
+be moved across a volatile load freely, but not an Acquire load.
+
+This document is intended to provide a guide to anyone either writing a frontend
+for LLVM or working on optimization passes for LLVM with a guide for how to deal
+with instructions with special semantics in the presence of concurrency.  This
+is not intended to be a precise guide to the semantics; the details can get
+extremely complicated and unreadable, and are not usually necessary.
+
+.. _Optimization outside atomic:
+
+Optimization outside atomic
+===========================
+
+The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
+lead to undefined results in a concurrent environment; see `NotAtomic`_. This
+section specifically goes into the one optimizer restriction which applies in
+concurrent environments, which gets a bit more of an extended description
+because any optimization dealing with stores needs to be aware of it.
+
+From the optimizer's point of view, the rule is that if there are not any
+instructions with atomic ordering involved, concurrency does not matter, with
+one exception: if a variable might be visible to another thread or signal
+handler, a store cannot be inserted along a path where it might not execute
+otherwise.  Take the following example:
+
+.. code-block:: c
+
+ /* C code, for readability; run through clang -O2 -S -emit-llvm to get
+     equivalent IR */
+  int x;
+  void f(int* a) {
+    for (int i = 0; i < 100; i++) {
+      if (a[i])
+        x += 1;
+    }
+  }
+
+The following is equivalent in non-concurrent situations:
+
+.. code-block:: c
+
+  int x;
+  void f(int* a) {
+    int xtemp = x;
+    for (int i = 0; i < 100; i++) {
+      if (a[i])
+        xtemp += 1;
+    }
+    x = xtemp;
+  }
+
+However, LLVM is not allowed to transform the former to the latter: it could
+indirectly introduce undefined behavior if another thread can access ``x`` at
+the same time. (This example is particularly of interest because before the
+concurrency model was implemented, LLVM would perform this transformation.)
+
+Note that speculative loads are allowed; a load which is part of a race returns
+``undef``, but does not have undefined behavior.
+
+Atomic instructions
+===================
+
+For cases where simple loads and stores are not sufficient, LLVM provides
+various atomic instructions. The exact guarantees provided depend on the
+ordering; see `Atomic orderings`_.
+
+``load atomic`` and ``store atomic`` provide the same basic functionality as
+non-atomic loads and stores, but provide additional guarantees in situations
+where threads and signals are involved.
+
+``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
+atomic store (where the store is conditional for ``cmpxchg``), but no other
+memory operation can happen on any thread between the load and store.  Note that
+LLVM's cmpxchg does not provide quite as many options as the C++0x version.
+
+A ``fence`` provides Acquire and/or Release ordering which is not part of
+another operation; it is normally used along with Monotonic memory operations.
+A Monotonic load followed by an Acquire fence is roughly equivalent to an
+Acquire load.
+
+Frontends generating atomic instructions generally need to be aware of the
+target to some degree; atomic instructions are guaranteed to be lock-free, and
+therefore an instruction which is wider than the target natively supports can be
+impossible to generate.
+
+.. _Atomic orderings:
+
+Atomic orderings
+================
+
+In order to achieve a balance between performance and necessary guarantees,
+there are six levels of atomicity. They are listed in order of strength; each
+level includes all the guarantees of the previous level except for
+Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
+
+.. _NotAtomic:
+
+NotAtomic
+---------
+
+NotAtomic is the obvious, a load or store which is not atomic. (This isn't
+really a level of atomicity, but is listed here for comparison.) This is
+essentially a regular load or store. If there is a race on a given memory
+location, loads from that location return undef.
+
+Relevant standard
+  This is intended to match shared variables in C/C++, and to be used in any
+  other context where memory access is necessary, and a race is impossible. (The
+  precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
+
+Notes for frontends
+  The rule is essentially that all memory accessed with basic loads and stores
+  by multiple threads should be protected by a lock or other synchronization;
+  otherwise, you are likely to run into undefined behavior. If your frontend is
+  for a "safe" language like Java, use Unordered to load and store any shared
+  variable.  Note that NotAtomic volatile loads and stores are not properly
+  atomic; do not try to use them as a substitute. (Per the C/C++ standards,
+  volatile does provide some limited guarantees around asynchronous signals, but
+  atomics are generally a better solution.)
+
+Notes for optimizers
+  Introducing loads to shared variables along a codepath where they would not
+  otherwise exist is allowed; introducing stores to shared variables is not. See
+  `Optimization outside atomic`_.
+
+Notes for code generation
+  The one interesting restriction here is that it is not allowed to write to
+  bytes outside of the bytes relevant to a store.  This is mostly relevant to
+  unaligned stores: it is not allowed in general to convert an unaligned store
+  into two aligned stores of the same width as the unaligned store. Backends are
+  also expected to generate an i8 store as an i8 store, and not an instruction
+  which writes to surrounding bytes.  (If you are writing a backend for an
+  architecture which cannot satisfy these restrictions and cares about
+  concurrency, please send an email to llvmdev.)
+
+Unordered
+---------
+
+Unordered is the lowest level of atomicity. It essentially guarantees that races
+produce somewhat sane results instead of having undefined behavior.  It also
+guarantees the operation to be lock-free, so it do not depend on the data being
+part of a special atomic structure or depend on a separate per-process global
+lock.  Note that code generation will fail for unsupported atomic operations; if
+you need such an operation, use explicit locking.
+
+Relevant standard
+  This is intended to match the Java memory model for shared variables.
+
+Notes for frontends
+  This cannot be used for synchronization, but is useful for Java and other
+  "safe" languages which need to guarantee that the generated code never
+  exhibits undefined behavior. Note that this guarantee is cheap on common
+  platforms for loads of a native width, but can be expensive or unavailable for
+  wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
+  languages would normally split a 64-bit store on ARM into two 32-bit unordered
+  stores.)
+
+Notes for optimizers
+  In terms of the optimizer, this prohibits any transformation that transforms a
+  single load into multiple loads, transforms a store into multiple stores,
+  narrows a store, or stores a value which would not be stored otherwise.  Some
+  examples of unsafe optimizations are narrowing an assignment into a bitfield,
+  rematerializing a load, and turning loads and stores into a memcpy
+  call. Reordering unordered operations is safe, though, and optimizers should
+  take advantage of that because unordered operations are common in languages
+  that need them.
+
+Notes for code generation
+  These operations are required to be atomic in the sense that if you use
+  unordered loads and unordered stores, a load cannot see a value which was
+  never stored.  A normal load or store instruction is usually sufficient, but
+  note that an unordered load or store cannot be split into multiple
+  instructions (or an instruction which does multiple memory operations, like
+  ``LDRD`` on ARM).
+
+Monotonic
+---------
+
+Monotonic is the weakest level of atomicity that can be used in synchronization
+primitives, although it does not provide any general synchronization. It
+essentially guarantees that if you take all the operations affecting a specific
+address, a consistent ordering exists.
+
+Relevant standard
+  This corresponds to the C++0x/C1x ``memory_order_relaxed``; see those
+  standards for the exact definition.
+
+Notes for frontends
+  If you are writing a frontend which uses this directly, use with caution.  The
+  guarantees in terms of synchronization are very weak, so make sure these are
+  only used in a pattern which you know is correct.  Generally, these would
+  either be used for atomic operations which do not protect other memory (like
+  an atomic counter), or along with a ``fence``.
+
+Notes for optimizers
+  In terms of the optimizer, this can be treated as a read+write on the relevant
+  memory location (and alias analysis will take advantage of that). In addition,
+  it is legal to reorder non-atomic and Unordered loads around Monotonic
+  loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
+  operations are unlikely to be used in ways which would make those
+  optimizations useful.
+
+Notes for code generation
+  Code generation is essentially the same as that for unordered for loads and
+  stores.  No fences are required.  ``cmpxchg`` and ``atomicrmw`` are required
+  to appear as a single operation.
+
+Acquire
+-------
+
+Acquire provides a barrier of the sort necessary to acquire a lock to access
+other memory with normal loads and stores.
+
+Relevant standard
+  This corresponds to the C++0x/C1x ``memory_order_acquire``. It should also be
+  used for C++0x/C1x ``memory_order_consume``.
+
+Notes for frontends
+  If you are writing a frontend which uses this directly, use with caution.
+  Acquire only provides a semantic guarantee when paired with a Release
+  operation.
+
+Notes for optimizers
+  Optimizers not aware of atomics can treat this like a nothrow call.  It is
+  also possible to move stores from before an Acquire load or read-modify-write
+  operation to after it, and move non-Acquire loads from before an Acquire
+  operation to after it.
+
+Notes for code generation
+  Architectures with weak memory ordering (essentially everything relevant today
+  except x86 and SPARC) require some sort of fence to maintain the Acquire
+  semantics.  The precise fences required varies widely by architecture, but for
+  a simple implementation, most architectures provide a barrier which is strong
+  enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.).  Putting
+  such a fence after the equivalent Monotonic operation is sufficient to
+  maintain Acquire semantics for a memory operation.
+
+Release
+-------
+
+Release is similar to Acquire, but with a barrier of the sort necessary to
+release a lock.
+
+Relevant standard
+  This corresponds to the C++0x/C1x ``memory_order_release``.
+
+Notes for frontends
+  If you are writing a frontend which uses this directly, use with caution.
+  Release only provides a semantic guarantee when paired with a Acquire
+  operation.
+
+Notes for optimizers
+  Optimizers not aware of atomics can treat this like a nothrow call.  It is
+  also possible to move loads from after a Release store or read-modify-write
+  operation to before it, and move non-Release stores from after an Release
+  operation to before it.
+
+Notes for code generation
+  See the section on Acquire; a fence before the relevant operation is usually
+  sufficient for Release. Note that a store-store fence is not sufficient to
+  implement Release semantics; store-store fences are generally not exposed to
+  IR because they are extremely difficult to use correctly.
+
+AcquireRelease
+--------------
+
+AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
+barrier (for fences and operations which both read and write memory).
+
+Relevant standard
+  This corresponds to the C++0x/C1x ``memory_order_acq_rel``.
+
+Notes for frontends
+  If you are writing a frontend which uses this directly, use with caution.
+  Acquire only provides a semantic guarantee when paired with a Release
+  operation, and vice versa.
+
+Notes for optimizers
+  In general, optimizers should treat this like a nothrow call; the the possible
+  optimizations are usually not interesting.
+
+Notes for code generation
+  This operation has Acquire and Release semantics; see the sections on Acquire
+  and Release.
+
+SequentiallyConsistent
+----------------------
+
+SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
+and Release semantics for stores. Additionally, it guarantees that a total
+ordering exists between all SequentiallyConsistent operations.
+
+Relevant standard
+  This corresponds to the C++0x/C1x ``memory_order_seq_cst``, Java volatile, and
+  the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
+
+Notes for frontends
+  If a frontend is exposing atomic operations, these are much easier to reason
+  about for the programmer than other kinds of operations, and using them is
+  generally a practical performance tradeoff.
+
+Notes for optimizers
+  Optimizers not aware of atomics can treat this like a nothrow call.  For
+  SequentiallyConsistent loads and stores, the same reorderings are allowed as
+  for Acquire loads and Release stores, except that SequentiallyConsistent
+  operations may not be reordered.
+
+Notes for code generation
+  SequentiallyConsistent loads minimally require the same barriers as Acquire
+  operations and SequentiallyConsistent stores require Release
+  barriers. Additionally, the code generator must enforce ordering between
+  SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
+  is usually done by emitting either a full fence before the loads or a full
+  fence after the stores; which is preferred varies by architecture.
+
+Atomics and IR optimization
+===========================
+
+Predicates for optimizer writers to query:
+
+* ``isSimple()``: A load or store which is not volatile or atomic.  This is
+  what, for example, memcpyopt would check for operations it might transform.
+
+* ``isUnordered()``: A load or store which is not volatile and at most
+  Unordered. This would be checked, for example, by LICM before hoisting an
+  operation.
+
+* ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
+  that they return true for any operation which is volatile or at least
+  Monotonic.
+
+* Alias analysis: Note that AA will return ModRef for anything Acquire or
+  Release, and for the address accessed by any Monotonic operation.
+
+To support optimizing around atomic operations, make sure you are using the
+right predicates; everything should work if that is done.  If your pass should
+optimize some atomic operations (Unordered operations in particular), make sure
+it doesn't replace an atomic load or store with a non-atomic operation.
+
+Some examples of how optimizations interact with various kinds of atomic
+operations:
+
+* ``memcpyopt``: An atomic operation cannot be optimized into part of a
+  memcpy/memset, including unordered loads/stores.  It can pull operations
+  across some atomic operations.
+
+* LICM: Unordered loads/stores can be moved out of a loop.  It just treats
+  monotonic operations like a read+write to a memory location, and anything
+  stricter than that like a nothrow call.
+
+* DSE: Unordered stores can be DSE'ed like normal stores.  Monotonic stores can
+  be DSE'ed in some cases, but it's tricky to reason about, and not especially
+  important.
+
+* Folding a load: Any atomic load from a constant global can be constant-folded,
+  because it cannot be observed.  Similar reasoning allows scalarrepl with
+  atomic loads and stores.
+
+Atomics and Codegen
+===================
+
+Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
+On architectures which use barrier instructions for all atomic ordering (like
+ARM), appropriate fences are split out as the DAG is built.
+
+The MachineMemOperand for all atomic operations is currently marked as volatile;
+this is not correct in the IR sense of volatile, but CodeGen handles anything
+marked volatile very conservatively.  This should get fixed at some point.
+
+Common architectures have some way of representing at least a pointer-sized
+lock-free ``cmpxchg``; such an operation can be used to implement all the other
+atomic operations which can be represented in IR up to that size.  Backends are
+expected to implement all those operations, but not operations which cannot be
+implemented in a lock-free manner.  It is expected that backends will give an
+error when given an operation which cannot be implemented.  (The LLVM code
+generator is not very helpful here at the moment, but hopefully that will
+change.)
+
+The implementation of atomics on LL/SC architectures (like ARM) is currently a
+bit of a mess; there is a lot of copy-pasted code across targets, and the
+representation is relatively unsuited to optimization (it would be nice to be
+able to optimize loops involving cmpxchg etc.).
+
+On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
+generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
+fences generate an ``MFENCE``, other fences do not cause any code to be
+generated.  cmpxchg uses the ``LOCK CMPXCHG`` instruction.  ``atomicrmw xchg``
+uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
+other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``.  Depending
+on the users of the result, some ``atomicrmw`` operations can be translated into
+operations like ``LOCK AND``, but that does not work in general.
+
+On ARM, MIPS, and many other RISC architectures, Acquire, Release, and
+SequentiallyConsistent semantics require barrier instructions for every such
+operation. Loads and stores generate normal instructions.  ``cmpxchg`` and
+``atomicrmw`` can be represented using a loop with LL/SC-style instructions
+which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
+on ARM, etc.). At the moment, the IR does not provide any way to represent a
+weak ``cmpxchg`` which would not require a loop.