mirror of
https://github.com/cc65/cc65.git
synced 2024-10-16 11:23:57 +00:00
390 lines
17 KiB
Plaintext
390 lines
17 KiB
Plaintext
|
|
Berkeley SoftFloat Release 2c General Documentation
|
|
|
|
John R. Hauser
|
|
2015 January 30
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Introduction
|
|
|
|
Berkeley SoftFloat is a software implementation of binary floating-point
|
|
that conforms to the IEEE Standard for Floating-Point Arithmetic. For
|
|
Release 2c of SoftFloat, as many as four formats are supported: 32-bit
|
|
single-precision, 64-bit double-precision, 80-bit double-extended-precision,
|
|
and 128-bit quadruple-precision. All operations required by the older 1985
|
|
version of the IEEE Standard are implemented, except for conversions to and
|
|
from decimal.
|
|
|
|
This document gives information about the types defined and the routines
|
|
implemented by this release of SoftFloat. It does not attempt to define or
|
|
explain the IEEE Floating-Point Standard. Details about the standard are
|
|
available elsewhere.
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Limitations
|
|
|
|
SoftFloat is written in C and is designed to work with other C code. The
|
|
SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt
|
|
has been made to accomodate compilers that are not ISO-conformant. In
|
|
particular, the distributed header files will not be acceptable to any
|
|
compiler that does not recognize function prototypes.
|
|
|
|
Support for the 80-bit double-extended-precision and 128-bit quadruple-
|
|
precision formats depends on a C compiler that implements 64-bit integer
|
|
arithmetic. If the largest integer format supported by the C compiler is
|
|
32 bits, SoftFloat is limited to only 32-bit single-precision and 64-bit
|
|
double-precision. When that is the case, all references in this document
|
|
to 80-bit double-extended-precision, 128-bit quadruple-precision, and 64-bit
|
|
integers should be ignored.
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Contents
|
|
|
|
Introduction
|
|
Limitations
|
|
Contents
|
|
Legal Notice
|
|
Types and Functions
|
|
Rounding Modes
|
|
Double-Extended-Precision Rounding Precision
|
|
Exceptions and Exception Flags
|
|
Function Details
|
|
Conversion Functions
|
|
Basic Arithmetic Functions
|
|
Remainder Functions
|
|
Round-to-Integer Functions
|
|
Comparison Functions
|
|
Signaling NaN Test Functions
|
|
Raise-Exception Function
|
|
Contact Information
|
|
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Legal Notice
|
|
|
|
SoftFloat was written by John R. Hauser. Release 2c of SoftFloat was made
|
|
possible in part by the International Computer Science Institute, located
|
|
at Suite 600, 1947 Center Street, Berkeley, California 94704. Funding
|
|
was partially provided by the National Science Foundation under grant
|
|
MIP-9311980. The original version of this code was written as part of a
|
|
project to build a fixed-point vector processor in collaboration with the
|
|
University of California at Berkeley, overseen by Profs. Nelson Morgan and
|
|
John Wawrzynek.
|
|
|
|
THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort
|
|
has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
|
|
TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO
|
|
PERSONS AND ORGANIZATIONS WHO CAN AND WILL TOLERATE ALL LOSSES, COSTS, OR
|
|
OTHER PROBLEMS THEY INCUR DUE TO THE SOFTWARE WITHOUT RECOMPENSE FROM JOHN
|
|
HAUSER OR THE INTERNATIONAL COMPUTER SCIENCE INSTITUTE, AND WHO FURTHERMORE
|
|
EFFECTIVELY INDEMNIFY JOHN HAUSER AND THE INTERNATIONAL COMPUTER SCIENCE
|
|
INSTITUTE (possibly via similar legal notice) AGAINST ALL LOSSES, COSTS, OR
|
|
OTHER PROBLEMS INCURRED BY THEIR CUSTOMERS AND CLIENTS DUE TO THE SOFTWARE,
|
|
OR INCURRED BY ANYONE DUE TO A DERIVATIVE WORK THEY CREATE USING ANY PART OF
|
|
THE SOFTWARE.
|
|
|
|
The following are expressly permitted, even for commercial purposes:
|
|
(1) distribution of SoftFloat in whole or in part, as long as this and
|
|
other legal notices remain and are prominent, and provided also that, for a
|
|
partial distribution, prominent notice is given that it is a subset of the
|
|
original; and
|
|
(2) inclusion or use of SoftFloat in whole or in part in a derivative
|
|
work, provided that the use restrictions above are met and the minimal
|
|
documentation requirements stated in the source code are satisfied.
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Types and Functions
|
|
|
|
When 64-bit integers are supported by the compiler, the `softfloat.h' header
|
|
file defines four types: `float32' (32-bit single-precision), `float64'
|
|
(64-bit double-precision), `floatx80' (80-bit double-extended-precision),
|
|
and `float128' (128-bit quadruple-precision). The `float32' and `float64'
|
|
types are defined in terms of 32-bit and 64-bit integer types, respectively,
|
|
while the `float128' type is defined as a structure of two 64-bit integers,
|
|
taking into account the byte order of the particular machine being used.
|
|
The `floatx80' type is defined as a structure containing one 16-bit and one
|
|
64-bit integer, with the machine's byte order again determining the order
|
|
within the structure.
|
|
|
|
When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
|
|
header file defines only two types: `float32' and `float64'. Because
|
|
the ISO/ANSI C Standard guarantees at least one built-in integer type of
|
|
32 bits, the `float32' type is identified with an appropriate integer type.
|
|
The `float64' type is defined as a structure of two 32-bit integers, with
|
|
the machine's byte order determining the order of the fields.
|
|
|
|
In either case, the types in `softfloat.h' are defined such that if a system
|
|
implements the usual C `float' and `double' types according to the IEEE
|
|
Standard, then the `float32' and `float64' types should be indistinguishable
|
|
in memory from the native `float' and `double' types. (On the other hand,
|
|
when `float32' or `float64' values are placed in processor registers by
|
|
the compiler, the type of registers used may differ from those used for the
|
|
native `float' and `double' types.)
|
|
|
|
SoftFloat implements the following arithmetic operations:
|
|
|
|
-- Conversions among all the floating-point formats, and also between
|
|
integers (32-bit and 64-bit) and any of the floating-point formats.
|
|
|
|
-- The usual add, subtract, multiply, divide, and square root operations for
|
|
all floating-point formats.
|
|
|
|
-- For each format, the floating-point remainder operation defined by the
|
|
IEEE Standard.
|
|
|
|
-- For each floating-point format, a "round to integer" operation that
|
|
rounds to the nearest integer value in the same format. (The floating-
|
|
point formats can hold integer values, of course.)
|
|
|
|
-- Comparisons between two values in the same floating-point format.
|
|
|
|
The only functions required by the 1985 IEEE Standard that are not provided
|
|
are conversions to and from decimal.
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Rounding Modes
|
|
|
|
All four rounding modes prescribed by the 1985 IEEE Standard are implemented
|
|
for all operations that require rounding. The rounding mode is selected
|
|
by the global variable `float_rounding_mode'. This variable may be set
|
|
to one of the values `float_round_nearest_even', `float_round_to_zero',
|
|
`float_round_down', or `float_round_up'. The rounding mode is initialized
|
|
to nearest/even.
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Double-Extended-Precision Rounding Precision
|
|
|
|
For 80-bit double-extended-precision (`floatx80') only, the rounding
|
|
precision of the basic arithmetic operations is controlled by the global
|
|
variable `floatx80_rounding_precision'. The operations affected are:
|
|
|
|
floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
|
|
|
|
When `floatx80_rounding_precision' is set to its default value of 80,
|
|
these operations are rounded (as usual) to the full precision of the 80-bit
|
|
double-extended-precision format. Setting `floatx80_rounding_precision' to
|
|
32 or to 64 causes the operations listed to be rounded to reduced precision
|
|
equivalent to 32-bit single-precision (`float32') or to 64-bit double-
|
|
precision (`float64'), respectively. When rounding to reduced precision,
|
|
additional bits in the result significand beyond the rounding point are set
|
|
to zero. The consequences of setting `floatx80_rounding_precision' to a
|
|
value other than 32, 64, or 80 is not specified. Operations other than the
|
|
ones listed above are not affected by `floatx80_rounding_precision'.
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Exceptions and Exception Flags
|
|
|
|
All five exception flags required by the IEEE Standard are implemented.
|
|
Each flag is stored as a separate bit in the global variable
|
|
`float_exception_flags'. The positions of the exception flag bits within
|
|
this variable are determined by the bit masks `float_flag_inexact',
|
|
`float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
|
|
`float_flag_invalid'. The exception flags variable is initialized to all 0,
|
|
meaning no exceptions.
|
|
|
|
An individual exception flag can be cleared with the statement
|
|
|
|
float_exception_flags &= ~ float_flag_<exception>;
|
|
|
|
where `<exception>' is the appropriate name. To raise a floating-point
|
|
exception, the SoftFloat function `float_raise' should be used (see below).
|
|
|
|
In the terminology of the IEEE Standard, SoftFloat can detect tininess
|
|
for underflow either before or after rounding. The choice is made by
|
|
the global variable `float_detect_tininess', which can be set to either
|
|
`float_tininess_before_rounding' or `float_tininess_after_rounding'.
|
|
Detecting tininess after rounding is better because it results in fewer
|
|
spurious underflow signals. The other option is provided for compatibility
|
|
with some systems. Like most systems, SoftFloat always detects loss of
|
|
accuracy for underflow as an inexact result.
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Function Details
|
|
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
Conversion Functions
|
|
|
|
All conversions among the floating-point formats are supported, as are all
|
|
conversions between a floating-point format and 32-bit and 64-bit signed
|
|
integers. The complete set of conversion functions is:
|
|
|
|
int32_to_float32 int64_to_float32
|
|
int32_to_float64 int64_to_float64
|
|
int32_to_floatx80 int64_to_floatx80
|
|
int32_to_float128 int64_to_float128
|
|
|
|
float32_to_int32 float32_to_int64
|
|
float64_to_int32 float64_to_int64
|
|
floatx80_to_int32 floatx80_to_int64
|
|
float128_to_int32 float128_to_int64
|
|
|
|
float32_to_float64 float32_to_floatx80 float32_to_float128
|
|
float64_to_float32 float64_to_floatx80 float64_to_float128
|
|
floatx80_to_float32 floatx80_to_float64 floatx80_to_float128
|
|
float128_to_float32 float128_to_float64 float128_to_floatx80
|
|
|
|
Each conversion function takes one operand of the appropriate type and
|
|
returns one result. Conversions from a smaller to a larger floating-point
|
|
format are always exact and so require no rounding. Conversions from 32-bit
|
|
integers to 64-bit double-precision and larger formats are also exact, and
|
|
likewise for conversions from 64-bit integers to 80-bit double-extended-
|
|
precision and 128-bit quadruple-precision.
|
|
|
|
Conversions from floating-point to integer raise the invalid exception if
|
|
the source value cannot be rounded to a representable integer of the desired
|
|
size (32 or 64 bits). If the floating-point operand is a NaN, the largest
|
|
positive integer is returned. Otherwise, if the conversion overflows, the
|
|
largest integer with the same sign as the operand is returned.
|
|
|
|
On conversions to integer, if the floating-point operand is not already
|
|
an integer value, the operand is rounded according to the current rounding
|
|
mode as specified by `float_rounding_mode'. Because C (and perhaps other
|
|
languages) require that conversions to integers be rounded toward zero, the
|
|
following functions are provided for improved speed and convenience:
|
|
|
|
float32_to_int32_round_to_zero float32_to_int64_round_to_zero
|
|
float64_to_int32_round_to_zero float64_to_int64_round_to_zero
|
|
floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero
|
|
float128_to_int32_round_to_zero float128_to_int64_round_to_zero
|
|
|
|
These variant functions ignore `float_rounding_mode' and always round toward
|
|
zero.
|
|
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
Basic Arithmetic Functions
|
|
|
|
The following basic arithmetic functions are provided:
|
|
|
|
float32_add float32_sub float32_mul float32_div float32_sqrt
|
|
float64_add float64_sub float64_mul float64_div float64_sqrt
|
|
floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
|
|
float128_add float128_sub float128_mul float128_div float128_sqrt
|
|
|
|
Each function takes two operands, except for `sqrt' which takes only one.
|
|
The operands and result are all of the same type.
|
|
|
|
Rounding of the 80-bit double-extended-precision (`floatx80') functions is
|
|
affected by the `floatx80_rounding_precision' variable, as explained above
|
|
in the section _Double-Extended-Precision Rounding Precision_.
|
|
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
Remainder Functions
|
|
|
|
For each format, SoftFloat implements the remainder function according to
|
|
the IEEE Standard. The remainder functions are:
|
|
|
|
float32_rem
|
|
float64_rem
|
|
floatx80_rem
|
|
float128_rem
|
|
|
|
Each remainder function takes two operands. The operands and result are all
|
|
of the same type. Given operands x and y, the remainder functions return
|
|
the value x - n*y, where n is the integer closest to x/y. If x/y is exactly
|
|
halfway between two integers, n is the even integer closest to x/y. The
|
|
remainder functions are always exact and so require no rounding.
|
|
|
|
Depending on the relative magnitudes of the operands, the remainder
|
|
functions can take considerably longer to execute than the other SoftFloat
|
|
functions. This is inherent in the remainder operation itself and is not a
|
|
flaw in the SoftFloat implementation.
|
|
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
Round-to-Integer Functions
|
|
|
|
For each format, SoftFloat implements the round-to-integer function
|
|
specified by the IEEE Standard. The functions are:
|
|
|
|
float32_round_to_int
|
|
float64_round_to_int
|
|
floatx80_round_to_int
|
|
float128_round_to_int
|
|
|
|
Each function takes a single floating-point operand and returns a result of
|
|
the same type. (Note that the result is not an integer type.) The operand
|
|
is rounded to an exact integer according to the current rounding mode, and
|
|
the resulting integer value is returned in the same floating-point format.
|
|
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
Comparison Functions
|
|
|
|
The following floating-point comparison functions are provided:
|
|
|
|
float32_eq float32_le float32_lt
|
|
float64_eq float64_le float64_lt
|
|
floatx80_eq floatx80_le floatx80_lt
|
|
float128_eq float128_le float128_lt
|
|
|
|
Each function takes two operands of the same type and returns a 1 or 0
|
|
representing either _true_ or _false_. The abbreviation `eq' stands for
|
|
"equal" (=); `le' stands for "less than or equal" (<=); and `lt' stands for
|
|
"less than" (<).
|
|
|
|
The usual greater-than (>), greater-than-or-equal (>=), and not-equal (!=)
|
|
functions are easily obtained using the functions provided. The not-equal
|
|
function is just the logical complement of the equal function. The greater-
|
|
than-or-equal function is identical to the less-than-or-equal function with
|
|
the operands reversed, and the greater-than function is identical to the
|
|
less-than function with the operands reversed.
|
|
|
|
The IEEE Standard specifies that the less-than-or-equal and less-than
|
|
functions raise the invalid exception if either input is any kind of NaN.
|
|
The equal functions, on the other hand, are defined not to raise the invalid
|
|
exception on quiet NaNs. For completeness, SoftFloat provides the following
|
|
additional functions:
|
|
|
|
float32_eq_signaling float32_le_quiet float32_lt_quiet
|
|
float64_eq_signaling float64_le_quiet float64_lt_quiet
|
|
floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet
|
|
float128_eq_signaling float128_le_quiet float128_lt_quiet
|
|
|
|
The `signaling' equal functions are identical to the standard functions
|
|
except that the invalid exception is raised for any NaN input. Likewise,
|
|
the `quiet' comparison functions are identical to their counterparts except
|
|
that the invalid exception is not raised for quiet NaNs.
|
|
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
Signaling NaN Test Functions
|
|
|
|
The following functions test whether a floating-point value is a signaling
|
|
NaN:
|
|
|
|
float32_is_signaling_nan
|
|
float64_is_signaling_nan
|
|
floatx80_is_signaling_nan
|
|
float128_is_signaling_nan
|
|
|
|
The functions take one operand and return 1 if the operand is a signaling
|
|
NaN and 0 otherwise.
|
|
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
Raise-Exception Function
|
|
|
|
SoftFloat provides a function for raising floating-point exceptions:
|
|
|
|
float_raise
|
|
|
|
The function takes a mask indicating the set of exceptions to raise. No
|
|
result is returned. In addition to setting the specified exception flags,
|
|
this function may cause a trap or abort appropriate for the current system.
|
|
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
|
|
|
|
|
----------------------------------------------------------------------------
|
|
Contact Information
|
|
|
|
At the time of this writing, the most up-to-date information about SoftFloat
|
|
and the latest release can be found at the Web page `http://www.jhauser.us/
|
|
arithmetic/SoftFloat.html'.
|
|
|