Update to IEEE 754-2019 Terminology

This document describes changes to the Java Language Specification in support of JDK-7074799 to use the floating-point terminology of IEEE 754-2019 rather than continuing to use the obsolete terminology of IEEE 754-1985.

Changes are described with respect to existing sections of the JLS. New text is indicated like this and deleted text is indicated ~~like this~~. Explanation and discussion, as needed, is set aside in grey boxes.

Chapter 1: Introduction

1.7 References

Apple Computer. Dylan Reference Manual. Apple Computer Inc., Cupertino, California. September 29, 1995.

Bobrow, Daniel G., Linda G. DeMichiel, Richard P. Gabriel, Sonya E. Keene, Gregor Kiczales, and David A. Moon. Common Lisp Object System Specification, X3J13 Document 88-002R, June 1988; appears as Chapter 28 of Steele, Guy. Common Lisp: The Language, 2nd ed. Digital Press, 1990, ISBN 1-55558-041-6, 770-864.

Ellis, Margaret A., and Bjarne Stroustrup. The Annotated C++ Reference Manual. Addison-Wesley, Reading, Massachusetts, 1990, reprinted with corrections October 1992, ISBN 0-201-51459-1.

Goldberg, Adele and Robson, David. Smalltalk-80: The Language. Addison-Wesley, Reading, Massachusetts, 1989, ISBN 0-201-13688-0.

Harbison, Samuel. Modula-3. Prentice Hall, Englewood Cliffs, New Jersey, 1992, ISBN 0-13-596396.

Hoare, C. A. R. Hints on Programming Language Design. Stanford University Computer Science Department Technical Report No. CS-73-403, December 1973. Reprinted in SIGACT/SIGPLAN Symposium on Principles of Programming Languages. Association for Computing Machinery, New York, October 1973.

IEEE Standard for ~~Binary~~ Floating-Point Arithmetic. ANSI/IEEE Std. 754-~~1985~~2019. Available from Global Engineering Documents, 15 Inverness Way East, Englewood, Colorado 80112-5704 USA; 800-854-7179.

Kernighan, Brian W., and Dennis M. Ritchie. The C Programming Language, 2nd ed. Prentice Hall, Englewood Cliffs, New Jersey, 1988, ISBN 0-13-110362-8.

Madsen, Ole Lehrmann, Birger Møller-Pedersen, and Kristen Nygaard. Object-Oriented Programming in the Beta Programming Language. Addison-Wesley, Reading, Massachusetts, 1993, ISBN 0-201-62430-3.

Mitchell, James G., William Maybury, and Richard Sweet. The Mesa Programming Language, Version 5.0. Xerox PARC, Palo Alto, California, CSL 79-3, April 1979.

Stroustrup, Bjarne. The C++ Progamming Language, 2nd ed. Addison-Wesley, Reading, Massachusetts, 1991, reprinted with corrections January 1994, ISBN 0-201-53992-6.

Unicode Consortium, The. The Unicode Standard, Version 12.1.0. Mountain View, California, 2019, ISBN 978-1-936213-25-2.

Chapter 3: Lexical Structure

3.10 Literals

3.10.2 Floating-Point Literals

A floating-point literal has the following parts: a whole-number part, a decimal or hexadecimal point (represented by an ASCII period character), a fraction part, an exponent, and a type suffix.

A floating-point literal may be expressed in decimal (base 10) or hexadecimal (base 16).

For decimal floating-point literals, at least one digit (in either the whole number or the fraction part) and either a decimal point, an exponent, or a float type suffix are required. All other parts are optional. The exponent, if present, is indicated by the ASCII letter e or E followed by an optionally signed integer.

For hexadecimal floating-point literals, at least one digit is required (in either the whole number or the fraction part), and the exponent is mandatory, and the float type suffix is optional. The exponent is indicated by the ASCII letter p or P followed by an optionally signed integer.

Underscores are allowed as separators between digits that denote the whole-number part, and between digits that denote the fraction part, and between digits that denote the exponent.

FloatingPointLiteral:: DecimalFloatingPointLiteral; HexadecimalFloatingPointLiteral
DecimalFloatingPointLiteral:: Digits . [Digits] [ExponentPart] [FloatTypeSuffix]; . Digits [ExponentPart] [FloatTypeSuffix]; Digits ExponentPart [FloatTypeSuffix]; Digits [ExponentPart] FloatTypeSuffix
ExponentPart:: ExponentIndicator SignedInteger
ExponentIndicator:: (one of); e E
SignedInteger:: [Sign] Digits
Sign:: (one of); + -
FloatTypeSuffix:: (one of); f F d D
HexadecimalFloatingPointLiteral:: HexSignificand BinaryExponent [FloatTypeSuffix]
HexSignificand:: HexNumeral [.]; 0 x [HexDigits] . HexDigits; 0 X [HexDigits] . HexDigits
BinaryExponent:: BinaryExponentIndicator SignedInteger
BinaryExponentIndicator:: (one of); p P

A floating-point literal is of type float if it is suffixed with an ASCII letter F or f; otherwise its type is double and it can optionally be suffixed with an ASCII letter D or d (4.2.3).

The elements of the types float and double are those values that can be represented using the IEEE 754 32-bit ~~single-precision~~binary32 and 64-bit ~~double-precision~~binary64 binary floating-point formats, respectively.

In the 1985 edition of the IEEE 754 standard, the binary32 format was known as single and binary64 format was known as double.

The details of proper input conversion from a Unicode string representation of a floating-point number to the internal IEEE 754 binary floating-point representation are described for the methods valueOf of class Float and class Double of the package java.lang.

The largest positive finite literal of type float is 3.4028235e38f.

The smallest positive finite non-zero literal of type float is 1.40e-45f.

The largest positive finite literal of type double is 1.7976931348623157e308.

The smallest positive finite non-zero literal of type double is 4.9e-324.

It is a compile-time error if a non-zero floating-point literal is too large, so that on rounded conversion to its internal representation, it becomes an IEEE 754 infinity.

A program can represent infinities without producing a compile-time error by using constant expressions such as 1f/0f or -1d/0d or by using the predefined constants POSITIVE_INFINITY and NEGATIVE_INFINITY of the classes Float and Double.

It is a compile-time error if a non-zero floating-point literal is too small, so that, on rounded conversion to its internal representation, it becomes a zero.

A compile-time error does not occur if a non-zero floating-point literal has a small value that, on rounded conversion to its internal representation, becomes a non-zero ~~denormalized~~subnormal number.

Predefined constants representing Not-a-Number values are defined in the classes Float and Double as Float.NaN and Double.NaN.

Examples of float literals:
1e1f    2.f    .3f    0f    3.14f    6.022137e+23f
Examples of double literals:
1e1    2.    .3    0.0    3.14    1e-9d    1e137

Chapter 4: Types, Values, and Variables

4.2 Primitive Types and Values

A primitive type is predefined by the Java programming language and named by its reserved keyword (3.9):

PrimitiveType:: {Annotation} NumericType; {Annotation} boolean
NumericType:: IntegralType; FloatingPointType
IntegralType:: (one of); byte short int long char
FloatingPointType:: (one of); float double

Primitive values do not share state with other primitive values.

The numeric types are the integral types and the floating-point types.

The integral types are byte, short, int, and long, whose values are 8-bit, 16-bit, 32-bit and 64-bit signed two's-complement integers, respectively, and char, whose values are 16-bit unsigned integers representing UTF-16 code units (3.1).

The floating-point types are float, whose values include the ~~32-bit~~binary32 IEEE 754 floating-point numbers, and double, whose values include the ~~64-bit~~binary64 IEEE 754 floating-point numbers.

The boolean type has exactly two values: true and false.

4.2.3 Floating-Point Types, Formats, and Values

The floating-point types are float and double, which are conceptually associated with the ~~single-precision 32-bit~~binary32 and ~~double-precision 64-bit~~binary64 formats for IEEE 754 values and operations, as specified in the IEEE 754 Standard (1.7) ~~IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Standard 754-1985 (IEEE, New York)~~.

Versions of the Java programming language prior to Java SE 15 used the 1985 edition of the IEEE 754 floating-point standard. An upgrade to the 2019 edition of the IEEE 754 standard occurred in Java SE 15. In the 1985 edition of the IEEE 754 standard, the binary32 format was known as single and binary64 format was known as double.

The IEEE 754 standard includes not only positive and negative numbers that consist of a sign and magnitude, but also positive and negative zeros, positive and negative infinities, and special Not-a-Number values (hereafter abbreviated NaN). A NaN value is used to represent the result of certain invalid operations such as dividing zero by zero. NaN constants of both float and double type are predefined as Float.NaN and Double.NaN.

Every implementation of the Java programming language is required to support two standard sets of floating-point values, called the float value set and the double value set. In addition, an implementation of the Java programming language may support either or both of two extended-exponent floating-point value sets, called the float-extended-exponent value set and the double-extended-exponent value set. These extended-exponent value sets may, under certain circumstances, be used instead of the standard value sets to represent the values of expressions of type float or double (5.1.13, 15.4).

The finite nonzero values of any floating-point value set can all be expressed in the form s ⋅ m ⋅ 2^(e\ -\ N\ +\ 1)^, where s is +1 or -1, m is a positive integer less than 2^N, and e is an integer between E_min = -(2^K-1-2) and E_max = 2^K-1-1, inclusive, and where N and K are parameters that depend on the value set. Some values can be represented in this form in more than one way; for example, supposing that a value v in a value set might be represented in this form using certain values for s, m, and e, then if it happened that m were even and e were less than 2^K-1, one could halve m and increase e by 1 to produce a second representation for the same value v. A representation in this form is called normalized if m ≥ 2^N-1; otherwise the representation is said to be ~~denormalized~~subnormal. If a value in a value set cannot be represented in such a way that m ≥ 2^N-1, then the value is said to be a ~~denormalized~~subnormal value, because ~~it has no normalized representation~~its magnitude is below the magnitude of the smallest normalized value in the format.

The constraints on the parameters N and K (and on the derived parameters E_min and E_max) for the two required and two optional floating-point value sets are summarized in Table 4.2.3-A.

::: {.table #jls-4.2.3-140-A}

Table 4.2.3-A. Floating-point value set parameters

::
Parameter	float	float-extended-exponent	double	double-extended-exponent
N	24	24	53	53
K	8	≥ 11	11	≥ 15
E_max	+127	≥ +1023	+1023	≥ +16383
E_min	-126	≤ -1022	-1022	≤ -16382

Where one or both extended-exponent value sets are supported by an implementation, then for each supported extended-exponent value set there is a specific implementation-dependent constant K, whose value is constrained by Table 4.2.3-A; this value K in turn dictates the values for E_min and E_max.

Each of the four value sets includes not only the finite nonzero values that are ascribed to it above, but also NaN values and the four values positive zero, negative zero, positive infinity, and negative infinity.

Note that the constraints in Table 4.2.3-A are designed so that every element of the float value set is necessarily also an element of the float-extended-exponent value set, the double value set, and the double-extended-exponent value set. Likewise, each element of the double value set is necessarily also an element of the double-extended-exponent value set. Each extended-exponent value set has a larger range of exponent values than the corresponding standard value set, but does not have more precision.

The elements of the float value set are exactly the values that can be represented using the ~~single~~binary32 floating-point format defined in the IEEE 754 standard. The elements of the double value set are exactly the values that can be represented using the ~~double~~binary64 floating-point format defined in the IEEE 754 standard. Note, however, that the elements of the float-extended-exponent and double-extended-exponent value sets defined here do not correspond to the values that can be represented using IEEE 754 ~~single~~binary32 extended and ~~double~~binary64 extended formats, respectively.

The float, float-extended-exponent, double, and double-extended-exponent value sets are not types. It is always correct for an implementation of the Java programming language to use an element of the float value set to represent a value of type float; however, it may be permissible in certain regions of code for an implementation to use an element of the float-extended-exponent value set instead. Similarly, it is always correct for an implementation to use an element of the double value set to represent a value of type double; however, it may be permissible in certain regions of code for an implementation to use an element of the double-extended-exponent value set instead.

Except for NaN, floating-point values are ordered; arranged from smallest to largest, they are negative infinity, negative finite nonzero values, positive and negative zero, positive finite nonzero values, and positive infinity.

IEEE 754 allows multiple distinct NaN values for each of its ~~single~~binary32 and ~~double~~binary64 floating-point formats. While each hardware architecture returns a particular bit pattern for NaN when a new NaN is generated, a programmer can also create NaNs with different bit patterns to encode, for example, retrospective diagnostic information.

For the most part, the Java SE Platform treats NaN values of a given type as though collapsed into a single canonical value, and hence this specification normally refers to an arbitrary NaN as though to a canonical value.

However, version 1.3 of the Java SE Platform introduced methods enabling the programmer to distinguish between NaN values: the Float.floatToRawIntBits and Double.doubleToRawLongBits methods. The interested reader is referred to the specifications for the Float and Double classes for more information.

Positive zero and negative zero compare equal; thus the result of the expression 0.0==-0.0 is true and the result of 0.0>-0.0 is false. But other operations can distinguish positive and negative zero; for example, 1.0/0.0 has the value positive infinity, while the value of 1.0/-0.0 is negative infinity.

NaN is unordered, so:

The numerical comparison operators <, <=, >, and >= return false if either or both operands are NaN (15.20.1).

In particular, (x<y) == !(x>=y) will be false if x or y is NaN.
The equality operator == returns false if either operand is NaN.
The inequality operator != returns true if either operand is NaN (15.21.1).

In particular, x!=x is true if and only if x is NaN.

4.2.4 Floating-Point Operations

The Java programming language provides a number of operators that act on floating-point values:

The comparison operators, which result in a value of type boolean:
- The numerical comparison operators <, <=, >, and >= (15.20.1)
- The numerical equality operators == and != (15.21.1)
The numerical operators, which result in a value of type float or double:
- The unary plus and minus operators + and - (15.15.3, 15.15.4)
- The multiplicative operators *, /, and % (15.17)
- The additive operators + and - (15.18.2)
- The increment operator ++, both prefix (15.15.1) and postfix (15.14.2)
- The decrement operator --, both prefix (15.15.2) and postfix (15.14.3)
The conditional operator ? : (15.25)
The cast operator (15.16), which can convert from a floating-point value to a value of any specified numeric type
The string concatenation operator + ([15.18.1]), which, when given a String operand and a floating-point operand, will convert the floating-point operand to a String representing its value in decimal form (without information loss), and then produce a newly created String by concatenating the two strings

Other useful constructors, methods, and constants are predefined in the classes Float, Double, and Math.

If at least one of the operands to a binary operator is of floating-point type, then the operation is a floating-point operation, even if the other is integral.

If at least one of the operands to a numerical operator is of type double, then the operation is carried out using 64-bit floating-point arithmetic, and the result of the numerical operator is a value of type double. If the other operand is not a double, it is first widened (5.1.5) to type double by numeric promotion (5.6).

Otherwise, the operation is carried out using 32-bit floating-point arithmetic, and the result of the numerical operator is a value of type float. (If the other operand is not a float, it is first widened to type float by numeric promotion.)

Any value of a floating-point type may be cast to or from any numeric type. There are no casts between floating-point types and the type boolean.

See 4.2.5 for an idiom to convert floating-point expressions to boolean.

Operators on floating-point numbers behave as specified by IEEE 754 (with the exception of the remainder operator (15.17.3)). In particular, the Java programming language requires support of IEEE 754 ~~denormalized~~subnormal floating-point numbers and gradual underflow, which make it easier to prove desirable properties of particular numerical algorithms. Floating-point operations do not "flush to zero" if the calculated result is a ~~denormalized~~subnormal number.

Floating-point arithmetic is an approximation to real arithmetic. While there are an infinite number of real numbers, a particular floating-point format only has a finite number of values. A rounding policy is a function used in floating-point arithmetic to map from a real number to a floating-point value in a format. For real numbers in the representable range of a floating-point format, a continuous segment of the real number line is mapped to a single floating-point value. The real number whose value is numerically equal to a floating-point value is mapped to that floating-point value. For example, the real number 1.5 gets mapped to the floating-point value '1.5' in a given format.

The Java programming language defines two rounding policies, as follows:

The round to nearest rounding policy applies to all floating-point operations except converting to an integer value and remainder. Under the round to nearest rounding policy, inexact results must be rounded to the representable value nearest to the infinitely precise result; if the two nearest representable values are equally near, the one with its least significant bit zero is chosen.

The round to nearest rounding policy corresponds to the default rounding-direction attribute for binary arithmetic in IEEE 754, roundTiesToEven.

The roundTiesToEven rounding-direction attribute was known as the "round to nearest" rounding mode in the 1985 edition of IEEE 754. The name of the rounding policy in the Java programming language is drawn from the name of this rounding mode.
The round toward zero rounding policy applies when converting a floating-point value to an integer value (5.1.3) and remainder (15.17.3). Under the round toward zero rounding policy, inexact results are rounded to the nearest representable value that is not greater in magnitude than the infinity precise result. For converting to integer, the round toward zero rounding policy is equivalent to truncation where fractional significand bits are discarded.

The round toward zero rounding policy corresponds to the roundTowardZero rounding-direction attribute in IEEE 754.

The roundTowardZero rounding-direction attribute was known as the "round toward zero" rounding mode in the 1985 edition of IEEE 754. The name of the rounding policy in the Java programming language is drawn from the name of this rounding mode.

The Java programming language requires that floating-point arithmetic behave as if every floating-point ~~operator~~operation rounded its floating-point result to the result precision. The rounding policy used for the arithmetic operations is round to nearest, except for converting a floating-point value to an integer and remainder where round toward zero is used instead.

Inexact results must be rounded to the representable value nearest to the infinitely precise result; if the two nearest representable values are equally near, the one with its least significant bit zero is chosen. ~~This is the IEEE 754 standard's default rounding mode known as round to nearest.~~

The Java programming language uses round toward zero when converting a floating value to an integer (5.1.3), which acts, in this case, as though the number were truncated, discarding the mantissa bits. ~~Rounding toward zero chooses as its result the format's value closest to and no greater in magnitude than the infinitely precise result.~~

A floating-point operation that overflows produces a signed infinity.

A floating-point operation that underflows produces a ~~denormalized~~subnormal value or a signed zero.

A floating-point operation that has no unique mathematically ~~definite~~defined result produces NaN.

All numeric operations with NaN as an operand produce NaN as a result.

A floating-point operator can throw an exception (11) for the following reasons:

Any floating-point operator can throw a NullPointerException if unboxing conversion (5.1.8) of a null reference is required.
The increment and decrement operators ++ (15.14.2, 15.15.1) and -- (15.14.3, 15.15.2) can throw an OutOfMemoryError if boxing conversion (5.1.7) is required and there is not sufficient memory available to perform the conversion.

:::example

Example 4.2.4-1. Floating-point Operations

class Test {
    public static void main(String[] args) {
        // An example of overflow:
        double d = 1e308;
        System.out.print("overflow produces infinity: ");
        System.out.println(d + "*10==" + d*10);
        // An example of gradual underflow:
        d = 1e-305 * Math.PI;
        System.out.print("gradual underflow: " + d + "\n   ");
        for (int i = 0; i < 4; i++)
            System.out.print(" " + (d /= 100000));
        System.out.println();
        // An example of NaN:
        System.out.print("0.0/0.0 is Not-a-Number: ");
        d = 0.0/0.0;
        System.out.println(d);
        // An example of inexact results and rounding:
        System.out.print("inexact results with float:");
        for (int i = 0; i < 100; i++) {
            float z = 1.0f / i;
            if (z * i != 1.0f)
                System.out.print(" " + i);
        }
        System.out.println();
        // Another example of inexact results and rounding:
        System.out.print("inexact results with double:");
        for (int i = 0; i < 100; i++) {
            double z = 1.0 / i;
            if (z * i != 1.0)
                System.out.print(" " + i);
        }
        System.out.println();
        // An example of cast to integer rounding:
        System.out.print("cast to int rounds toward 0: ");
        d = 12345.6;
        System.out.println((int)d + " " + (int)(-d));
    }
}

This program produces the output:

overflow produces infinity: 1.0E308*10==Infinity
gradual underflow: 3.141592653589793E-305
    3.1415926535898E-310 3.141592653E-315 3.142E-320 0.0
0.0/0.0 is Not-a-Number: NaN
inexact results with float: 0 41 47 55 61 82 83 94 97
inexact results with double: 0 49 98
cast to int rounds toward 0: 12345 -12345

This example demonstrates, among other things, that gradual underflow can result in a gradual loss of precision.

The results when i is 0 involve division by zero, so that z becomes positive infinity, and z * 0 is NaN, which is not equal to 1.0.

:::

Chapter 5: Conversions and Contexts

5.1 Kinds of Conversion

5.1.2 Widening Primitive Conversion

19 specific conversions on primitive types are called the widening primitive conversions:

byte to short, int, long, float, or double
short to int, long, float, or double
char to int, long, float, or double
int to long, float, or double
long to float or double
float to double

A widening primitive conversion does not lose information about the overall magnitude of a numeric value in the following cases, where the numeric value is preserved exactly:

from an integral type to another integral type
from byte, short, or char to a floating point type
from int to double
from float to double in a strictfp expression (15.4)

A widening primitive conversion from float to double that is not strictfp may lose information about the overall magnitude of the converted value.

A widening primitive conversion from int to float, or from long to float, or from long to double, may result in loss of precision - that is, the result may lose some of the least significant bits of the value. In this case, the resulting floating-point value will be a correctly rounded version of the integer value, using ~~IEEE 754 round-to-nearest mode~~the round to nearest rounding policy (4.2.4).

A widening conversion of a signed integer value to an integral type T simply sign-extends the two's-complement representation of the integer value to fill the wider format.

A widening conversion of a char to an integral type T zero-extends the representation of the char value to fill the wider format.

Despite the fact that loss of precision may occur, a widening primitive conversion never results in a run-time exception (11.1.1).

5.1.3 Narrowing Primitive Conversion

22 specific conversions on primitive types are called the narrowing primitive conversions:

short to byte or char
char to byte or short
int to byte, short, or char
long to byte, short, char, or int
float to byte, short, char, int, or long
double to byte, short, char, int, long, or float

A narrowing primitive conversion may lose information about the overall magnitude of a numeric value and may also lose precision and range.

A narrowing primitive conversion from double to float ~~is governed by the IEEE 754 rounding rules~~uses the rounding to nearest rounding policy (4.2.4). This conversion can lose precision, but also lose range, resulting in a float zero from a nonzero double and a float infinity from a finite double. A double NaN is converted to a float NaN and a double infinity is converted to the same-signed float infinity.

A narrowing conversion of a signed integer to an integral type T simply discards all but the n lowest order bits, where n is the number of bits used to represent type T. In addition to a possible loss of information about the magnitude of the numeric value, this may cause the sign of the resulting value to differ from the sign of the input value.

A narrowing conversion of a char to an integral type T likewise simply discards all but the n lowest order bits, where n is the number of bits used to represent type T. In addition to a possible loss of information about the magnitude of the numeric value, this may cause the resulting value to be a negative number, even though chars represent 16-bit unsigned integer values.

A narrowing conversion of a floating-point number to an integral type T takes two steps:

In the first step, the floating-point number is converted either to a long, if T is long, or to an int, if T is byte, short, char, or int, as follows:
- If the floating-point number is NaN (4.2.3), the result of the first step of the conversion is an int or long 0.
- Otherwise, if the floating-point number is not an infinity, the floating-point value is rounded to an integer value V, rounding toward zero using ~~IEEE 754 round-toward-zero mode~~the round toward zero rounding policy ~~(4.2.3)~~(4.2.4). Then there are two cases:
  1. If T is long, and this integer value can be represented as a long, then the result of the first step is the long value V.
  2. Otherwise, if this integer value can be represented as an int, then the result of the first step is the int value V.
- Otherwise, one of the following two cases must be true:
  1. The value must be too small (a negative value of large magnitude or negative infinity), and the result of the first step is the smallest representable value of type int or long.
  2. The value must be too large (a positive value of large magnitude or positive infinity), and the result of the first step is the largest representable value of type int or long.
In the second step:
- If T is int or long, the result of the conversion is the result of the first step.
- If T is byte, char, or short, the result of the conversion is the result of a narrowing conversion to type T (5.1.3) of the result of the first step.

Despite the fact that overflow, underflow, or other loss of information may occur, a narrowing primitive conversion never results in a run-time exception (11.1.1).

5.1.13 Value Set Conversion

Value set conversion is the process of mapping a floating-point value from one value set to another without changing its type.

Within an expression that is not FP-strict (15.4), value set conversion provides choices to an implementation of the Java programming language:

If the value is an element of the float-extended-exponent value set, then the implementation may, at its option, map the value to the nearest element of the float value set. This conversion may result in overflow (in which case the value is replaced by an infinity of the same sign) or underflow (in which case the value may lose precision because it is replaced by a ~~denormalized~~subnormal number or zero of the same sign).
If the value is an element of the double-extended-exponent value set, then the implementation may, at its option, map the value to the nearest element of the double value set. This conversion may result in overflow (in which case the value is replaced by an infinity of the same sign) or underflow (in which case the value may lose precision because it is replaced by a ~~denormalized~~subnormal number or zero of the same sign).

Within an FP-strict expression (15.4), value set conversion does not provide any choices; every implementation must behave in the same way:

If the value is of type float and is not an element of the float value set, then the implementation must map the value to the nearest element of the float value set. This conversion may result in overflow or underflow.
If the value is of type double and is not an element of the double value set, then the implementation must map the value to the nearest element of the double value set. This conversion may result in overflow or underflow.

Within an FP-strict expression, mapping values from the float-extended-exponent value set or double-extended-exponent value set is necessary only when a method is invoked whose declaration is not FP-strict and the implementation has chosen to represent the result of the method invocation as an element of an extended-exponent value set.

Whether in FP-strict code or code that is not FP-strict, value set conversion always leaves unchanged any value whose type is neither float nor double.

Chapter 15: Expressions

15.4 FP-strict Expressions

If the type of an expression is float or double, then there is a question as to what value set (4.2.3) the value of the expression is drawn from. This is governed by the rules of value set conversion (5.1.13); these rules in turn depend on whether or not the expression is FP-strict.

Every constant expression ([15.29]) is FP-strict.

If an expression is not a constant expression, then consider all the class declarations, interface declarations, and method declarations that contain the expression. If any such declaration bears the strictfp modifier (8.1.1.3, 8.4.3.5, 9.1.1.2), then the expression is FP-strict.

If a class, interface, or method, X, is declared strictfp, then X and any class, interface, method, constructor, instance initializer, static initializer, or variable initializer within X is said to be FP-strict.

Note that an annotation's element value (9.7) is always FP-strict, because it is always a constant expression.

It follows that an expression is not FP-strict if and only if it is not a constant expression and it does not appear within any declaration that has the strictfp modifier.

Within an FP-strict expression, all intermediate values must be elements of the float value set or the double value set, implying that the results of all FP-strict expressions must be those predicted by IEEE 754 arithmetic on operands represented using ~~single~~binary32 and ~~double~~binary64 formats.

Within an expression that is not FP-strict, some leeway is granted for an implementation to use an extended exponent range to represent intermediate results; the net effect, roughly speaking, is that a calculation might produce "the correct answer" in situations where exclusive use of the float value set or double value set might result in overflow or underflow.

15.17 Multiplicative Operators

15.17.1 Multiplication Operator `*`

The binary * operator performs multiplication, producing the product of its operands.

Multiplication is a commutative operation if the operand expressions have no side effects.

Integer multiplication is associative when the operands are all of the same type.

Floating-point multiplication is not associative.

If an integer multiplication overflows, then the result is the low-order bits of the mathematical product as represented in some sufficiently large two's-complement format. As a result, if overflow occurs, then the sign of the result may not be the same as the sign of the mathematical product of the two operand values.

The result of a floating-point multiplication is determined by the rules of IEEE 754 arithmetic:

If either operand is NaN, the result is NaN.
If the result is not NaN, the sign of the result is positive if both operands have the same sign, and negative if the operands have different signs.
Multiplication of an infinity by a zero results in NaN.
Multiplication of an infinity by a finite value results in a signed infinity. The sign is determined by the rule stated above.
In the remaining cases, where neither an infinity nor NaN is involved, the exact mathematical product is computed. A floating-point value set is then chosen:
- If the multiplication expression is FP-strict (15.4):
  - If the type of the multiplication expression is float, then the float value set must be chosen.
  - If the type of the multiplication expression is double, then the double value set must be chosen.
- If the multiplication expression is not FP-strict:
  - If the type of the multiplication expression is float, then either the float value set or the float-extended-exponent value set may be chosen, at the whim of the implementation.
  - If the type of the multiplication expression is double, then either the double value set or the double-extended-exponent value set may be chosen, at the whim of the implementation.
Next, a value must be chosen from the chosen value set to represent the product.

If the magnitude of the product is too large to represent, we say the operation overflows; the result is then an infinity of appropriate sign.

Otherwise, the product is rounded to the nearest value in the chosen value set using ~~IEEE 754 round-to-nearest mode~~the round to nearest rounding policy. The Java programming language requires support of gradual underflow ~~as defined by IEEE 754~~ (4.2.4).

Despite the fact that overflow, underflow, or loss of information may occur, evaluation of a multiplication operator * never throws a run-time exception.

15.17.2 Division Operator `/`

The binary / operator performs division, producing the quotient of its operands. The left-hand operand is the dividend and the right-hand operand is the divisor.

Integer division rounds toward 0. That is, the quotient produced for operands n and d that are integers after binary numeric promotion (5.6) is an integer value q whose magnitude is as large as possible while satisfying |d ⋅ q| ≤ |n|. Moreover, q is positive when |n| ≥ |d| and n and d have the same sign, but q is negative when |n| ≥ |d| and n and d have opposite signs.

There is one special case that does not satisfy this rule: if the dividend is the negative integer of largest possible magnitude for its type, and the divisor is -1, then integer overflow occurs and the result is equal to the dividend. Despite the overflow, no exception is thrown in this case. On the other hand, if the value of the divisor in an integer division is 0, then an ArithmeticException is thrown.

The result of a floating-point division is determined by the rules of IEEE 754 arithmetic:

If either operand is NaN, the result is NaN.
If the result is not NaN, the sign of the result is positive if both operands have the same sign, and negative if the operands have different signs.
Division of an infinity by an infinity results in NaN.
Division of an infinity by a finite value results in a signed infinity. The sign is determined by the rule stated above.
Division of a finite value by an infinity results in a signed zero. The sign is determined by the rule stated above.
Division of a zero by a zero results in NaN; division of zero by any other finite value results in a signed zero. The sign is determined by the rule stated above.
Division of a nonzero finite value by a zero results in a signed infinity. The sign is determined by the rule stated above.
In the remaining cases, where neither an infinity nor NaN is involved, the exact mathematical quotient is computed. A floating-point value set is then chosen:
- If the division expression is FP-strict (15.4):
  - If the type of the division expression is float, then the float value set must be chosen.
  - If the type of the division expression is double, then the double value set must be chosen.
- If the division expression is not FP-strict:
  - If the type of the division expression is float, then either the float value set or the float-extended-exponent value set may be chosen, at the whim of the implementation.
  - If the type of the division expression is double, then either the double value set or the double-extended-exponent value set may be chosen, at the whim of the implementation.
Next, a value must be chosen from the chosen value set to represent the quotient.

If the magnitude of the quotient is too large to represent, we say the operation overflows; the result is then an infinity of appropriate sign.

Otherwise, the quotient is rounded to the nearest value in the chosen value set using ~~IEEE 754 round-to-nearest mode~~the round to nearest rounding policy. The Java programming language requires support of gradual underflow ~~as defined by IEEE 754~~ (4.2.4).

Despite the fact that overflow, underflow, division by zero, or loss of information may occur, evaluation of a floating-point division operator / never throws a run-time exception.

15.17.3 Remainder Operator `%`

The binary % operator is said to yield the remainder of its operands from an implied division; the left-hand operand is the dividend and the right-hand operand is the divisor.

In C and C++, the remainder operator accepts only integral operands, but in the Java programming language, it also accepts floating-point operands.

The remainder operation for operands that are integers after binary numeric promotion (5.6) produces a result value such that (a/b)*b+(a%b) is equal to a.

This identity holds even in the special case that the dividend is the negative integer of largest possible magnitude for its type and the divisor is -1 (the remainder is 0).

It follows from this rule that the result of the remainder operation can be negative only if the dividend is negative, and can be positive only if the dividend is positive. Moreover, the magnitude of the result is always less than the magnitude of the divisor.

If the value of the divisor for an integer remainder operator is 0, then an ArithmeticException is thrown.

:::example

Example 15.17.3-1. Integer Remainder Operator

class Test1 {
    public static void main(String[] args) {
        int a = 5%3;  // 2
        int b = 5/3;  // 1
        System.out.println("5%3 produces " + a +
                           " (note that 5/3 produces " + b + ")");

        int c = 5%(-3);  // 2
        int d = 5/(-3);  // -1
        System.out.println("5%(-3) produces " + c +
                           " (note that 5/(-3) produces " + d + ")");

        int e = (-5)%3;  // -2
        int f = (-5)/3;  // -1
        System.out.println("(-5)%3 produces " + e +
                           " (note that (-5)/3 produces " + f + ")");

        int g = (-5)%(-3);  // -2
        int h = (-5)/(-3);  // 1
        System.out.println("(-5)%(-3) produces " + g +
                           " (note that (-5)/(-3) produces " + h + ")");
    }
}

This program produces the output:

5%3 produces 2 (note that 5/3 produces 1)
5%(-3) produces 2 (note that 5/(-3) produces -1)
(-5)%3 produces -2 (note that (-5)/3 produces -1)
(-5)%(-3) produces -2 (note that (-5)/(-3) produces 1)

:::

The result of a floating-point remainder operation as computed by the % operator is not the same as that produced by the remainder operation defined by IEEE 754, due to a different choice in rounding policy (4.2.4). The IEEE 754 remainder operation computes the remainder from a rounding division, an implied division using the round to nearest rounding policy, not a truncating division, an implied division using the round toward zero rounding policy, and so its behavior is not analogous to that of the usual integer remainder operator. Instead, the Java programming language defines % on floating-point operations to behave in a manner analogous to that of the integer remainder operator, with an implied division using the round toward zero rounding policy; this may be compared with the C library function fmod. The IEEE 754 remainder operation may be computed by the library routine Math.IEEEremainder or StrictMath.IEEEremainder.

The result of a floating-point remainder operation is determined ~~by the~~these rules ~~of IEEE 754 arithmetic~~, which match the rules for IEEE 754 remainder other than how the implied division is computed:

If either operand is NaN, the result is NaN.
If the result is not NaN, the sign of the result equals the sign of the dividend.
If the dividend is an infinity, or the divisor is a zero, or both, the result is NaN.
If the dividend is finite and the divisor is an infinity, the result equals the dividend.
If the dividend is a zero and the divisor is finite, the result equals the dividend.
In the remaining cases, where neither an infinity, nor a zero, nor NaN is involved, the floating-point remainder r from the division of a dividend n by a divisor d is defined by the mathematical relation r = n - (d ⋅ q) where q is an integer that is negative only if n/d is negative and positive only if n/d is positive, and whose magnitude is as large as possible without exceeding the magnitude of the true mathematical quotient of n and d.

Evaluation of a floating-point remainder operator % never throws a run-time exception, even if the right-hand operand is zero. Overflow, underflow, or loss of precision cannot occur.

:::example

Example 15.17.3-2. Floating-Point Remainder Operator

class Test2 {
    public static void main(String[] args) {
        double a = 5.0%3.0;  // 2.0
        System.out.println("5.0%3.0 produces " + a);

        double b = 5.0%(-3.0);  // 2.0
        System.out.println("5.0%(-3.0) produces " + b);

        double c = (-5.0)%3.0;  // -2.0
        System.out.println("(-5.0)%3.0 produces " + c);

        double d = (-5.0)%(-3.0);  // -2.0
        System.out.println("(-5.0)%(-3.0) produces " + d);
    }
}

This program produces the output:

5.0%3.0 produces 2.0
5.0%(-3.0) produces 2.0
(-5.0)%3.0 produces -2.0
(-5.0)%(-3.0) produces -2.0

:::

15.18 Additive Operators

15.18.2 Additive Operators (`+` and `-`) for Numeric Types

The binary + operator performs addition when applied to two operands of numeric type, producing the sum of the operands.

The binary - operator performs subtraction, producing the difference of two numeric operands.

Binary numeric promotion is performed on the operands (5.6).

Note that binary numeric promotion performs value set conversion (5.1.13) and may perform unboxing conversion (5.1.8).

The type of an additive expression on numeric operands is the promoted type of its operands.

If this promoted type is int or long, then integer arithmetic is performed.

If this promoted type is float or double, then floating-point arithmetic is performed.

Addition is a commutative operation if the operand expressions have no side effects.

Integer addition is associative when the operands are all of the same type.

Floating-point addition is not associative.

If an integer addition overflows, then the result is the low-order bits of the mathematical sum as represented in some sufficiently large two's-complement format. If overflow occurs, then the sign of the result is not the same as the sign of the mathematical sum of the two operand values.

The result of a floating-point addition is determined using the following rules of IEEE 754 arithmetic:

If either operand is NaN, the result is NaN.
The sum of two infinities of opposite sign is NaN.
The sum of two infinities of the same sign is the infinity of that sign.
The sum of an infinity and a finite value is equal to the infinite operand.
The sum of two zeros of opposite sign is positive zero.
The sum of two zeros of the same sign is the zero of that sign.
The sum of a zero and a nonzero finite value is equal to the nonzero operand.
The sum of two nonzero finite values of the same magnitude and opposite sign is positive zero.
In the remaining cases, where neither an infinity, nor a zero, nor NaN is involved, and the operands have the same sign or have different magnitudes, the exact mathematical sum is computed. A floating-point value set is then chosen:
- If the addition expression is FP-strict (15.4):
  - If the type of the addition expression is float, then the float value set must be chosen.
  - If the type of the addition expression is double, then the double value set must be chosen.
- If the addition expression is not FP-strict:
  - If the type of the addition expression is float, then either the float value set or the float-extended-exponent value set may be chosen, at the whim of the implementation.
  - If the type of the addition expression is double, then either the double value set or the double-extended-exponent value set may be chosen, at the whim of the implementation.
Next, a value must be chosen from the chosen value set to represent the sum.

If the magnitude of the sum is too large to represent, we say the operation overflows; the result is then an infinity of appropriate sign.

Otherwise, the sum is rounded to the nearest value in the chosen value set using ~~IEEE 754 round-to-nearest mode~~the round to nearest rounding policy. The Java programming language requires support of gradual underflow ~~as defined by IEEE 754~~ (4.2.4).

The binary - operator performs subtraction when applied to two operands of numeric type, producing the difference of its operands; the left-hand operand is the minuend and the right-hand operand is the subtrahend.

For both integer and floating-point subtraction, it is always the case that a-b produces the same result as a+(-b).

Note that, for integer values, subtraction from zero is the same as negation. However, for floating-point operands, subtraction from zero is not the same as negation, because if x is +0.0, then 0.0-x is +0.0, but -x is -0.0.

Despite the fact that overflow, underflow, or loss of information may occur, evaluation of a numeric additive operator never throws a run-time exception.

Chapter 1: Introduction

1.7 References

Chapter 3: Lexical Structure

3.10 Literals

3.10.2 Floating-Point Literals

Chapter 4: Types, Values, and Variables

4.2 Primitive Types and Values

4.2.3 Floating-Point Types, Formats, and Values

4.2.4 Floating-Point Operations

Chapter 5: Conversions and Contexts

5.1 Kinds of Conversion

5.1.2 Widening Primitive Conversion

5.1.3 Narrowing Primitive Conversion

5.1.13 Value Set Conversion

Chapter 15: Expressions

15.4 FP-strict Expressions

15.17 Multiplicative Operators

15.17.1 Multiplication Operator *

15.17.2 Division Operator /

15.17.3 Remainder Operator %

15.18 Additive Operators

15.18.2 Additive Operators (+ and -) for Numeric Types

15.17.1 Multiplication Operator `*`

15.17.2 Division Operator `/`

15.17.3 Remainder Operator `%`

15.18.2 Additive Operators (`+` and `-`) for Numeric Types