This document describes changes to the Java Language Specification in support of JDK-7074799 to use the floating-point terminology of IEEE 754-2019 rather than continuing to use the obsolete terminology of IEEE 754-1985.
Changes are described with respect to existing sections of the JLS. New text is indicated like this and deleted text is indicated like this. Explanation and discussion, as needed, is set aside in grey boxes.
Chapter 1: Introduction
1.7 References
Apple Computer. Dylan Reference Manual. Apple Computer Inc., Cupertino, California. September 29, 1995.
Bobrow, Daniel G., Linda G. DeMichiel, Richard P. Gabriel, Sonya E. Keene, Gregor Kiczales, and David A. Moon. Common Lisp Object System Specification, X3J13 Document 88-002R, June 1988; appears as Chapter 28 of Steele, Guy. Common Lisp: The Language, 2nd ed. Digital Press, 1990, ISBN 1-55558-041-6, 770-864.
Ellis, Margaret A., and Bjarne Stroustrup. The Annotated C++ Reference Manual. Addison-Wesley, Reading, Massachusetts, 1990, reprinted with corrections October 1992, ISBN 0-201-51459-1.
Goldberg, Adele and Robson, David. Smalltalk-80: The Language. Addison-Wesley, Reading, Massachusetts, 1989, ISBN 0-201-13688-0.
Harbison, Samuel. Modula-3. Prentice Hall, Englewood Cliffs, New Jersey, 1992, ISBN 0-13-596396.
Hoare, C. A. R. Hints on Programming Language Design. Stanford University Computer Science Department Technical Report No. CS-73-403, December 1973. Reprinted in SIGACT/SIGPLAN Symposium on Principles of Programming Languages. Association for Computing Machinery, New York, October 1973.
IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std. 754-19852019. Available from Global Engineering Documents, 15 Inverness Way East, Englewood, Colorado 80112-5704 USA; 800-854-7179.
Kernighan, Brian W., and Dennis M. Ritchie. The C Programming Language, 2nd ed. Prentice Hall, Englewood Cliffs, New Jersey, 1988, ISBN 0-13-110362-8.
Madsen, Ole Lehrmann, Birger Møller-Pedersen, and Kristen Nygaard. Object-Oriented Programming in the Beta Programming Language. Addison-Wesley, Reading, Massachusetts, 1993, ISBN 0-201-62430-3.
Mitchell, James G., William Maybury, and Richard Sweet. The Mesa Programming Language, Version 5.0. Xerox PARC, Palo Alto, California, CSL 79-3, April 1979.
Stroustrup, Bjarne. The C++ Progamming Language, 2nd ed. Addison-Wesley, Reading, Massachusetts, 1991, reprinted with corrections January 1994, ISBN 0-201-53992-6.
Unicode Consortium, The. The Unicode Standard, Version 12.1.0. Mountain View, California, 2019, ISBN 978-1-936213-25-2.
Chapter 3: Lexical Structure
3.10 Literals
3.10.2 Floating-Point Literals
A floating-point literal has the following parts: a whole-number part, a decimal or hexadecimal point (represented by an ASCII period character), a fraction part, an exponent, and a type suffix.
A floating-point literal may be expressed in decimal (base 10) or hexadecimal (base 16).
For decimal floating-point literals, at least one digit (in either the whole number or the fraction part) and either a decimal point, an exponent, or a float type suffix are required. All other parts are optional. The exponent, if present, is indicated by the ASCII letter e
or E
followed by an optionally signed integer.
For hexadecimal floating-point literals, at least one digit is required (in either the whole number or the fraction part), and the exponent is mandatory, and the float type suffix is optional. The exponent is indicated by the ASCII letter p
or P
followed by an optionally signed integer.
Underscores are allowed as separators between digits that denote the whole-number part, and between digits that denote the fraction part, and between digits that denote the exponent.
- FloatingPointLiteral:
- DecimalFloatingPointLiteral
- HexadecimalFloatingPointLiteral
- DecimalFloatingPointLiteral:
- Digits
.
[Digits] [ExponentPart] [FloatTypeSuffix] .
Digits [ExponentPart] [FloatTypeSuffix]- Digits ExponentPart [FloatTypeSuffix]
- Digits [ExponentPart] FloatTypeSuffix
- ExponentPart:
- ExponentIndicator SignedInteger
- ExponentIndicator:
- (one of)
e E
- SignedInteger:
- [Sign] Digits
- Sign:
- (one of)
+ -
- FloatTypeSuffix:
- (one of)
f F d D
- HexadecimalFloatingPointLiteral:
- HexSignificand BinaryExponent [FloatTypeSuffix]
- HexSignificand:
- HexNumeral [
.
] 0
x
[HexDigits].
HexDigits0
X
[HexDigits].
HexDigits- BinaryExponent:
- BinaryExponentIndicator SignedInteger
- BinaryExponentIndicator:
- (one of)
p P
A floating-point literal is of type float
if it is suffixed with an ASCII letter F
or f
; otherwise its type is double
and it can optionally be suffixed with an ASCII letter D
or d
(4.2.3).
The elements of the types float
and double
are those values that can be represented using the IEEE 754 32-bit single-precisionbinary32 and 64-bit double-precisionbinary64 binary floating-point formats, respectively.
In the 1985 edition of the IEEE 754 standard, the binary32 format was known as single and binary64 format was known as double.
The details of proper input conversion from a Unicode string representation of a floating-point number to the internal IEEE 754 binary floating-point representation are described for the methods
valueOf
of classFloat
and classDouble
of the packagejava.lang
.
The largest positive finite literal of type float
is 3.4028235e38f
.
The smallest positive finite non-zero literal of type float
is 1.40e-45f
.
The largest positive finite literal of type double
is 1.7976931348623157e308
.
The smallest positive finite non-zero literal of type double
is 4.9e-324
.
It is a compile-time error if a non-zero floating-point literal is too large, so that on rounded conversion to its internal representation, it becomes an IEEE 754 infinity.
A program can represent infinities without producing a compile-time error by using constant expressions such as 1f/0f
or -1d/0d
or by using the predefined constants POSITIVE_INFINITY
and NEGATIVE_INFINITY
of the classes Float
and Double
.
It is a compile-time error if a non-zero floating-point literal is too small, so that, on rounded conversion to its internal representation, it becomes a zero.
A compile-time error does not occur if a non-zero floating-point literal has a small value that, on rounded conversion to its internal representation, becomes a non-zero denormalizedsubnormal number.
Predefined constants representing Not-a-Number values are defined in the classes Float
and Double
as Float.NaN
and Double.NaN
.
Examples of
float
literals:1e1f 2.f .3f 0f 3.14f 6.022137e+23f
Examples of
double
literals:1e1 2. .3 0.0 3.14 1e-9d 1e137
Chapter 4: Types, Values, and Variables
4.2 Primitive Types and Values
A primitive type is predefined by the Java programming language and named by its reserved keyword (3.9):
- PrimitiveType:
- {Annotation} NumericType
- {Annotation}
boolean
- NumericType:
- IntegralType
- FloatingPointType
- IntegralType:
- (one of)
byte
short
int
long
char
- FloatingPointType:
- (one of)
float
double
Primitive values do not share state with other primitive values.
The numeric types are the integral types and the floating-point types.
The integral types are byte
, short
, int
, and long
, whose values are 8-bit, 16-bit, 32-bit and 64-bit signed two's-complement integers, respectively, and char
, whose values are 16-bit unsigned integers representing UTF-16 code units (3.1).
The floating-point types are float
, whose values include the 32-bitbinary32 IEEE 754 floating-point numbers, and double
, whose values include the 64-bitbinary64 IEEE 754 floating-point numbers.
The boolean
type has exactly two values: true
and false
.
4.2.3 Floating-Point Types, Formats, and Values
The floating-point types are float
and double
, which are conceptually associated with the single-precision 32-bitbinary32 and double-precision 64-bitbinary64 formats for IEEE 754 values and operations, as specified in the IEEE 754 Standard (1.7) IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Standard 754-1985 (IEEE, New York).
Versions of the Java programming language prior to Java SE 15 used the 1985 edition of the IEEE 754 floating-point standard. An upgrade to the 2019 edition of the IEEE 754 standard occurred in Java SE 15. In the 1985 edition of the IEEE 754 standard, the binary32 format was known as single and binary64 format was known as double.
The IEEE 754 standard includes not only positive and negative numbers that consist of a sign and magnitude, but also positive and negative zeros, positive and negative infinities, and special Not-a-Number values (hereafter abbreviated NaN). A NaN value is used to represent the result of certain invalid operations such as dividing zero by zero. NaN constants of both float
and double
type are predefined as Float.NaN
and Double.NaN
.
Every implementation of the Java programming language is required to support two standard sets of floating-point values, called the float value set and the double value set. In addition, an implementation of the Java programming language may support either or both of two extended-exponent floating-point value sets, called the float-extended-exponent value set and the double-extended-exponent value set. These extended-exponent value sets may, under certain circumstances, be used instead of the standard value sets to represent the values of expressions of type float
or double
(5.1.13, 15.4).
The finite nonzero values of any floating-point value set can all be expressed in the form s ⋅ m ⋅ 2^(e\ -\ N\ +\ 1)^, where s is +1 or -1, m is a positive integer less than 2N, and e is an integer between Emin = -(2K-1-2) and Emax = 2K-1-1, inclusive, and where N and K are parameters that depend on the value set. Some values can be represented in this form in more than one way; for example, supposing that a value v in a value set might be represented in this form using certain values for s, m, and e, then if it happened that m were even and e were less than 2K-1, one could halve m and increase e by 1 to produce a second representation for the same value v. A representation in this form is called normalized if m ≥ 2N-1; otherwise the representation is said to be denormalizedsubnormal. If a value in a value set cannot be represented in such a way that m ≥ 2N-1, then the value is said to be a denormalizedsubnormal value, because it has no normalized representationits magnitude is below the magnitude of the smallest normalized value in the format.
The constraints on the parameters N and K (and on the derived parameters Emin and Emax) for the two required and two optional floating-point value sets are summarized in Table 4.2.3-A.
::: {.table #jls-4.2.3-140-A}
Table 4.2.3-A. Floating-point value set parameters
Parameter | float | float-extended-exponent | double | double-extended-exponent |
---|---|---|---|---|
N | 24 | 24 | 53 | 53 |
K | 8 | ≥ 11 | 11 | ≥ 15 |
Emax | +127 | ≥ +1023 | +1023 | ≥ +16383 |
Emin | -126 | ≤ -1022 | -1022 | ≤ -16382 |
Where one or both extended-exponent value sets are supported by an implementation, then for each supported extended-exponent value set there is a specific implementation-dependent constant K, whose value is constrained by Table 4.2.3-A; this value K in turn dictates the values for Emin and Emax.
Each of the four value sets includes not only the finite nonzero values that are ascribed to it above, but also NaN values and the four values positive zero, negative zero, positive infinity, and negative infinity.
Note that the constraints in Table 4.2.3-A are designed so that every element of the float value set is necessarily also an element of the float-extended-exponent value set, the double value set, and the double-extended-exponent value set. Likewise, each element of the double value set is necessarily also an element of the double-extended-exponent value set. Each extended-exponent value set has a larger range of exponent values than the corresponding standard value set, but does not have more precision.
The elements of the float value set are exactly the values that can be represented using the singlebinary32 floating-point format defined in the IEEE 754 standard. The elements of the double value set are exactly the values that can be represented using the doublebinary64 floating-point format defined in the IEEE 754 standard. Note, however, that the elements of the float-extended-exponent and double-extended-exponent value sets defined here do not correspond to the values that can be represented using IEEE 754 singlebinary32 extended and doublebinary64 extended formats, respectively.
The float, float-extended-exponent, double, and double-extended-exponent value sets are not types. It is always correct for an implementation of the Java programming language to use an element of the float value set to represent a value of type float
; however, it may be permissible in certain regions of code for an implementation to use an element of the float-extended-exponent value set instead. Similarly, it is always correct for an implementation to use an element of the double value set to represent a value of type double
; however, it may be permissible in certain regions of code for an implementation to use an element of the double-extended-exponent value set instead.
Except for NaN, floating-point values are ordered; arranged from smallest to largest, they are negative infinity, negative finite nonzero values, positive and negative zero, positive finite nonzero values, and positive infinity.
IEEE 754 allows multiple distinct NaN values for each of its singlebinary32 and doublebinary64 floating-point formats. While each hardware architecture returns a particular bit pattern for NaN when a new NaN is generated, a programmer can also create NaNs with different bit patterns to encode, for example, retrospective diagnostic information.
For the most part, the Java SE Platform treats NaN values of a given type as though collapsed into a single canonical value, and hence this specification normally refers to an arbitrary NaN as though to a canonical value.
However, version 1.3 of the Java SE Platform introduced methods enabling the programmer to distinguish between NaN values: the
Float.floatToRawIntBits
andDouble.doubleToRawLongBits
methods. The interested reader is referred to the specifications for theFloat
andDouble
classes for more information.
Positive zero and negative zero compare equal; thus the result of the expression 0.0==-0.0
is true
and the result of 0.0>-0.0
is false. But other operations can distinguish positive and negative zero; for example, 1.0/0.0
has the value positive infinity, while the value of 1.0/-0.0
is negative infinity.
NaN is unordered, so:
The numerical comparison operators
<
,<=
,>
, and>=
returnfalse
if either or both operands are NaN (15.20.1).In particular,
(x<y) == !(x>=y)
will befalse
ifx
ory
is NaN.The equality operator
==
returnsfalse
if either operand is NaN.The inequality operator
!=
returnstrue
if either operand is NaN (15.21.1).In particular,
x!=x
istrue
if and only ifx
is NaN.
4.2.4 Floating-Point Operations
The Java programming language provides a number of operators that act on floating-point values:
The comparison operators, which result in a value of type
boolean
:The numerical operators, which result in a value of type
float
ordouble
:The conditional operator
? :
(15.25)The cast operator (15.16), which can convert from a floating-point value to a value of any specified numeric type
The string concatenation operator
+
([15.18.1]), which, when given aString
operand and a floating-point operand, will convert the floating-point operand to aString
representing its value in decimal form (without information loss), and then produce a newly createdString
by concatenating the two strings
Other useful constructors, methods, and constants are predefined in the classes Float
, Double
, and Math
.
If at least one of the operands to a binary operator is of floating-point type, then the operation is a floating-point operation, even if the other is integral.
If at least one of the operands to a numerical operator is of type double
, then the operation is carried out using 64-bit floating-point arithmetic, and the result of the numerical operator is a value of type double
. If the other operand is not a double
, it is first widened (5.1.5) to type double
by numeric promotion (5.6).
Otherwise, the operation is carried out using 32-bit floating-point arithmetic, and the result of the numerical operator is a value of type float
. (If the other operand is not a float
, it is first widened to type float
by numeric promotion.)
Any value of a floating-point type may be cast to or from any numeric type. There are no casts between floating-point types and the type boolean
.
See 4.2.5 for an idiom to convert floating-point expressions to
boolean
.
Operators on floating-point numbers behave as specified by IEEE 754 (with the exception of the remainder operator (15.17.3)). In particular, the Java programming language requires support of IEEE 754 denormalizedsubnormal floating-point numbers and gradual underflow, which make it easier to prove desirable properties of particular numerical algorithms. Floating-point operations do not "flush to zero" if the calculated result is a denormalizedsubnormal number.
Floating-point arithmetic is an approximation to real arithmetic. While there are an infinite number of real numbers, a particular floating-point format only has a finite number of values. A rounding policy is a function used in floating-point arithmetic to map from a real number to a floating-point value in a format. For real numbers in the representable range of a floating-point format, a continuous segment of the real number line is mapped to a single floating-point value. The real number whose value is numerically equal to a floating-point value is mapped to that floating-point value. For example, the real number 1.5 gets mapped to the floating-point value '1.5' in a given format.
The Java programming language defines two rounding policies, as follows:
The round to nearest rounding policy applies to all floating-point operations except converting to an integer value and remainder. Under the round to nearest rounding policy, inexact results must be rounded to the representable value nearest to the infinitely precise result; if the two nearest representable values are equally near, the one with its least significant bit zero is chosen.
The round to nearest rounding policy corresponds to the default rounding-direction attribute for binary arithmetic in IEEE 754, roundTiesToEven.
The roundTiesToEven rounding-direction attribute was known as the "round to nearest" rounding mode in the 1985 edition of IEEE 754. The name of the rounding policy in the Java programming language is drawn from the name of this rounding mode.
The round toward zero rounding policy applies when converting a floating-point value to an integer value (5.1.3) and remainder (15.17.3). Under the round toward zero rounding policy, inexact results are rounded to the nearest representable value that is not greater in magnitude than the infinity precise result. For converting to integer, the round toward zero rounding policy is equivalent to truncation where fractional significand bits are discarded.
The round toward zero rounding policy corresponds to the roundTowardZero rounding-direction attribute in IEEE 754.
The roundTowardZero rounding-direction attribute was known as the "round toward zero" rounding mode in the 1985 edition of IEEE 754. The name of the rounding policy in the Java programming language is drawn from the name of this rounding mode.
The Java programming language requires that floating-point arithmetic behave as if every floating-point operatoroperation rounded its floating-point result to the result precision. The rounding policy used for the arithmetic operations is round to nearest, except for converting a floating-point value to an integer and remainder where round toward zero is used instead.
Inexact results must be rounded to the representable value nearest to the infinitely precise result; if the two nearest representable values are equally near, the one with its least significant bit zero is chosen. This is the IEEE 754 standard's default rounding mode known as round to nearest.
The Java programming language uses round toward zero when converting a floating value to an integer (5.1.3), which acts, in this case, as though the number were truncated, discarding the mantissa bits. Rounding toward zero chooses as its result the format's value closest to and no greater in magnitude than the infinitely precise result.
A floating-point operation that overflows produces a signed infinity.
A floating-point operation that underflows produces a denormalizedsubnormal value or a signed zero.
A floating-point operation that has no unique mathematically definitedefined result produces NaN.
All numeric operations with NaN as an operand produce NaN as a result.
A floating-point operator can throw an exception (11) for the following reasons:
Any floating-point operator can throw a
NullPointerException
if unboxing conversion (5.1.8) of a null reference is required.The increment and decrement operators
++
(15.14.2, 15.15.1) and--
(15.14.3, 15.15.2) can throw anOutOfMemoryError
if boxing conversion (5.1.7) is required and there is not sufficient memory available to perform the conversion.
:::example
Example 4.2.4-1. Floating-point Operations
class Test {
public static void main(String[] args) {
// An example of overflow:
double d = 1e308;
System.out.print("overflow produces infinity: ");
System.out.println(d + "*10==" + d*10);
// An example of gradual underflow:
d = 1e-305 * Math.PI;
System.out.print("gradual underflow: " + d + "\n ");
for (int i = 0; i < 4; i++)
System.out.print(" " + (d /= 100000));
System.out.println();
// An example of NaN:
System.out.print("0.0/0.0 is Not-a-Number: ");
d = 0.0/0.0;
System.out.println(d);
// An example of inexact results and rounding:
System.out.print("inexact results with float:");
for (int i = 0; i < 100; i++) {
float z = 1.0f / i;
if (z * i != 1.0f)
System.out.print(" " + i);
}
System.out.println();
// Another example of inexact results and rounding:
System.out.print("inexact results with double:");
for (int i = 0; i < 100; i++) {
double z = 1.0 / i;
if (z * i != 1.0)
System.out.print(" " + i);
}
System.out.println();
// An example of cast to integer rounding:
System.out.print("cast to int rounds toward 0: ");
d = 12345.6;
System.out.println((int)d + " " + (int)(-d));
}
}
This program produces the output:
overflow produces infinity: 1.0E308*10==Infinity
gradual underflow: 3.141592653589793E-305
3.1415926535898E-310 3.141592653E-315 3.142E-320 0.0
0.0/0.0 is Not-a-Number: NaN
inexact results with float: 0 41 47 55 61 82 83 94 97
inexact results with double: 0 49 98
cast to int rounds toward 0: 12345 -12345
This example demonstrates, among other things, that gradual underflow can result in a gradual loss of precision.
The results when i
is 0
involve division by zero, so that z
becomes positive infinity, and z * 0
is NaN, which is not equal to 1.0
.
:::
Chapter 5: Conversions and Contexts
5.1 Kinds of Conversion
5.1.2 Widening Primitive Conversion
19 specific conversions on primitive types are called the widening primitive conversions:
byte
toshort
,int
,long
,float
, ordouble
short
toint
,long
,float
, ordouble
char
toint
,long
,float
, ordouble
int
tolong
,float
, ordouble
long
tofloat
ordouble
float
todouble
A widening primitive conversion does not lose information about the overall magnitude of a numeric value in the following cases, where the numeric value is preserved exactly:
from an integral type to another integral type
from
byte
,short
, orchar
to a floating point typefrom
int
todouble
from
float
todouble
in astrictfp
expression (15.4)
A widening primitive conversion from float
to double
that is not strictfp
may lose information about the overall magnitude of the converted value.
A widening primitive conversion from int
to float
, or from long
to float
, or from long
to double
, may result in loss of precision - that is, the result may lose some of the least significant bits of the value. In this case, the resulting floating-point value will be a correctly rounded version of the integer value, using IEEE 754 round-to-nearest modethe round to nearest rounding policy (4.2.4).
A widening conversion of a signed integer value to an integral type T simply sign-extends the two's-complement representation of the integer value to fill the wider format.
A widening conversion of a char
to an integral type T zero-extends the representation of the char
value to fill the wider format.
Despite the fact that loss of precision may occur, a widening primitive conversion never results in a run-time exception (11.1.1).
5.1.3 Narrowing Primitive Conversion
22 specific conversions on primitive types are called the narrowing primitive conversions:
short
tobyte
orchar
char
tobyte
orshort
int
tobyte
,short
, orchar
long
tobyte
,short
,char
, orint
float
tobyte
,short
,char
,int
, orlong
double
tobyte
,short
,char
,int
,long
, orfloat
A narrowing primitive conversion may lose information about the overall magnitude of a numeric value and may also lose precision and range.
A narrowing primitive conversion from double
to float
is governed by the IEEE 754 rounding rulesuses the rounding to nearest rounding policy (4.2.4). This conversion can lose precision, but also lose range, resulting in a float
zero from a nonzero double
and a float
infinity from a finite double
. A double
NaN is converted to a float
NaN and a double
infinity is converted to the same-signed float
infinity.
A narrowing conversion of a signed integer to an integral type T simply discards all but the n lowest order bits, where n is the number of bits used to represent type T. In addition to a possible loss of information about the magnitude of the numeric value, this may cause the sign of the resulting value to differ from the sign of the input value.
A narrowing conversion of a char
to an integral type T likewise simply discards all but the n lowest order bits, where n is the number of bits used to represent type T. In addition to a possible loss of information about the magnitude of the numeric value, this may cause the resulting value to be a negative number, even though chars represent 16-bit unsigned integer values.
A narrowing conversion of a floating-point number to an integral type T takes two steps:
In the first step, the floating-point number is converted either to a
long
, if T islong
, or to anint
, if T isbyte
,short
,char
, orint
, as follows:If the floating-point number is NaN (4.2.3), the result of the first step of the conversion is an
int
orlong
0
.Otherwise, if the floating-point number is not an infinity, the floating-point value is rounded to an integer value V, rounding toward zero using
IEEE 754 round-toward-zero modethe round toward zero rounding policy(4.2.3)(4.2.4). Then there are two cases:If T is
long
, and this integer value can be represented as along
, then the result of the first step is thelong
value V.Otherwise, if this integer value can be represented as an
int
, then the result of the first step is theint
value V.
Otherwise, one of the following two cases must be true:
The value must be too small (a negative value of large magnitude or negative infinity), and the result of the first step is the smallest representable value of type
int
orlong
.The value must be too large (a positive value of large magnitude or positive infinity), and the result of the first step is the largest representable value of type
int
orlong
.
In the second step:
If T is
int
orlong
, the result of the conversion is the result of the first step.If T is
byte
,char
, orshort
, the result of the conversion is the result of a narrowing conversion to type T (5.1.3) of the result of the first step.
Despite the fact that overflow, underflow, or other loss of information may occur, a narrowing primitive conversion never results in a run-time exception (11.1.1).
5.1.13 Value Set Conversion
Value set conversion is the process of mapping a floating-point value from one value set to another without changing its type.
Within an expression that is not FP-strict (15.4), value set conversion provides choices to an implementation of the Java programming language:
If the value is an element of the float-extended-exponent value set, then the implementation may, at its option, map the value to the nearest element of the float value set. This conversion may result in overflow (in which case the value is replaced by an infinity of the same sign) or underflow (in which case the value may lose precision because it is replaced by a
denormalizedsubnormal number or zero of the same sign).If the value is an element of the double-extended-exponent value set, then the implementation may, at its option, map the value to the nearest element of the double value set. This conversion may result in overflow (in which case the value is replaced by an infinity of the same sign) or underflow (in which case the value may lose precision because it is replaced by a
denormalizedsubnormal number or zero of the same sign).
Within an FP-strict expression (15.4), value set conversion does not provide any choices; every implementation must behave in the same way:
If the value is of type
float
and is not an element of the float value set, then the implementation must map the value to the nearest element of the float value set. This conversion may result in overflow or underflow.If the value is of type
double
and is not an element of the double value set, then the implementation must map the value to the nearest element of the double value set. This conversion may result in overflow or underflow.
Within an FP-strict expression, mapping values from the float-extended-exponent value set or double-extended-exponent value set is necessary only when a method is invoked whose declaration is not FP-strict and the implementation has chosen to represent the result of the method invocation as an element of an extended-exponent value set.
Whether in FP-strict code or code that is not FP-strict, value set conversion always leaves unchanged any value whose type is neither float
nor double
.
Chapter 15: Expressions
15.4 FP-strict Expressions
If the type of an expression is float
or double
, then there is a question as to what value set (4.2.3) the value of the expression is drawn from. This is governed by the rules of value set conversion (5.1.13); these rules in turn depend on whether or not the expression is FP-strict.
Every constant expression ([15.29]) is FP-strict.
If an expression is not a constant expression, then consider all the class declarations, interface declarations, and method declarations that contain the expression. If any such declaration bears the strictfp
modifier (8.1.1.3, 8.4.3.5, 9.1.1.2), then the expression is FP-strict.
If a class, interface, or method, X, is declared strictfp
, then X and any class, interface, method, constructor, instance initializer, static initializer, or variable initializer within X is said to be FP-strict.
Note that an annotation's element value (9.7) is always FP-strict, because it is always a constant expression.
It follows that an expression is not FP-strict if and only if it is not a constant expression and it does not appear within any declaration that has the strictfp
modifier.
Within an FP-strict expression, all intermediate values must be elements of the float value set or the double value set, implying that the results of all FP-strict expressions must be those predicted by IEEE 754 arithmetic on operands represented using singlebinary32 and doublebinary64 formats.
Within an expression that is not FP-strict, some leeway is granted for an implementation to use an extended exponent range to represent intermediate results; the net effect, roughly speaking, is that a calculation might produce "the correct answer" in situations where exclusive use of the float value set or double value set might result in overflow or underflow.
15.17 Multiplicative Operators
15.17.1 Multiplication Operator *
The binary *
operator performs multiplication, producing the product of its operands.
Multiplication is a commutative operation if the operand expressions have no side effects.
Integer multiplication is associative when the operands are all of the same type.
Floating-point multiplication is not associative.
If an integer multiplication overflows, then the result is the low-order bits of the mathematical product as represented in some sufficiently large two's-complement format. As a result, if overflow occurs, then the sign of the result may not be the same as the sign of the mathematical product of the two operand values.
The result of a floating-point multiplication is determined by the rules of IEEE 754 arithmetic:
If either operand is NaN, the result is NaN.
If the result is not NaN, the sign of the result is positive if both operands have the same sign, and negative if the operands have different signs.
Multiplication of an infinity by a zero results in NaN.
Multiplication of an infinity by a finite value results in a signed infinity. The sign is determined by the rule stated above.
In the remaining cases, where neither an infinity nor NaN is involved, the exact mathematical product is computed. A floating-point value set is then chosen:
If the multiplication expression is FP-strict (15.4):
If the type of the multiplication expression is
float
, then the float value set must be chosen.If the type of the multiplication expression is
double
, then the double value set must be chosen.
If the multiplication expression is not FP-strict:
If the type of the multiplication expression is
float
, then either the float value set or the float-extended-exponent value set may be chosen, at the whim of the implementation.If the type of the multiplication expression is
double
, then either the double value set or the double-extended-exponent value set may be chosen, at the whim of the implementation.
Next, a value must be chosen from the chosen value set to represent the product.
If the magnitude of the product is too large to represent, we say the operation overflows; the result is then an infinity of appropriate sign.
Otherwise, the product is rounded to the nearest value in the chosen value set using
IEEE 754 round-to-nearest modethe round to nearest rounding policy. The Java programming language requires support of gradual underflowas defined by IEEE 754(4.2.4).
Despite the fact that overflow, underflow, or loss of information may occur, evaluation of a multiplication operator *
never throws a run-time exception.
15.17.2 Division Operator /
The binary /
operator performs division, producing the quotient of its operands. The left-hand operand is the dividend and the right-hand operand is the divisor.
Integer division rounds toward 0
. That is, the quotient produced for operands n and d that are integers after binary numeric promotion (5.6) is an integer value q whose magnitude is as large as possible while satisfying |d ⋅ q| ≤ |n|. Moreover, q is positive when |n| ≥ |d| and n and d have the same sign, but q is negative when |n| ≥ |d| and n and d have opposite signs.
There is one special case that does not satisfy this rule: if the dividend is the negative integer of largest possible magnitude for its type, and the divisor is -1
, then integer overflow occurs and the result is equal to the dividend. Despite the overflow, no exception is thrown in this case. On the other hand, if the value of the divisor in an integer division is 0
, then an ArithmeticException
is thrown.
The result of a floating-point division is determined by the rules of IEEE 754 arithmetic:
If either operand is NaN, the result is NaN.
If the result is not NaN, the sign of the result is positive if both operands have the same sign, and negative if the operands have different signs.
Division of an infinity by an infinity results in NaN.
Division of an infinity by a finite value results in a signed infinity. The sign is determined by the rule stated above.
Division of a finite value by an infinity results in a signed zero. The sign is determined by the rule stated above.
Division of a zero by a zero results in NaN; division of zero by any other finite value results in a signed zero. The sign is determined by the rule stated above.
Division of a nonzero finite value by a zero results in a signed infinity. The sign is determined by the rule stated above.
In the remaining cases, where neither an infinity nor NaN is involved, the exact mathematical quotient is computed. A floating-point value set is then chosen:
If the division expression is FP-strict (15.4):
If the type of the division expression is
float
, then the float value set must be chosen.If the type of the division expression is
double
, then the double value set must be chosen.
If the division expression is not FP-strict:
If the type of the division expression is
float
, then either the float value set or the float-extended-exponent value set may be chosen, at the whim of the implementation.If the type of the division expression is
double
, then either the double value set or the double-extended-exponent value set may be chosen, at the whim of the implementation.
Next, a value must be chosen from the chosen value set to represent the quotient.
If the magnitude of the quotient is too large to represent, we say the operation overflows; the result is then an infinity of appropriate sign.
Otherwise, the quotient is rounded to the nearest value in the chosen value set using
IEEE 754 round-to-nearest modethe round to nearest rounding policy. The Java programming language requires support of gradual underflowas defined by IEEE 754(4.2.4).
Despite the fact that overflow, underflow, division by zero, or loss of information may occur, evaluation of a floating-point division operator /
never throws a run-time exception.
15.17.3 Remainder Operator %
The binary %
operator is said to yield the remainder of its operands from an implied division; the left-hand operand is the dividend and the right-hand operand is the divisor.
In C and C++, the remainder operator accepts only integral operands, but in the Java programming language, it also accepts floating-point operands.
The remainder operation for operands that are integers after binary numeric promotion (5.6) produces a result value such that (a/b)*b+(a%b)
is equal to a
.
This identity holds even in the special case that the dividend is the negative integer of largest possible magnitude for its type and the divisor is -1
(the remainder is 0
).
It follows from this rule that the result of the remainder operation can be negative only if the dividend is negative, and can be positive only if the dividend is positive. Moreover, the magnitude of the result is always less than the magnitude of the divisor.
If the value of the divisor for an integer remainder operator is 0
, then an ArithmeticException
is thrown.
:::example
Example 15.17.3-1. Integer Remainder Operator
class Test1 {
public static void main(String[] args) {
int a = 5%3; // 2
int b = 5/3; // 1
System.out.println("5%3 produces " + a +
" (note that 5/3 produces " + b + ")");
int c = 5%(-3); // 2
int d = 5/(-3); // -1
System.out.println("5%(-3) produces " + c +
" (note that 5/(-3) produces " + d + ")");
int e = (-5)%3; // -2
int f = (-5)/3; // -1
System.out.println("(-5)%3 produces " + e +
" (note that (-5)/3 produces " + f + ")");
int g = (-5)%(-3); // -2
int h = (-5)/(-3); // 1
System.out.println("(-5)%(-3) produces " + g +
" (note that (-5)/(-3) produces " + h + ")");
}
}
This program produces the output:
5%3 produces 2 (note that 5/3 produces 1)
5%(-3) produces 2 (note that 5/(-3) produces -1)
(-5)%3 produces -2 (note that (-5)/3 produces -1)
(-5)%(-3) produces -2 (note that (-5)/(-3) produces 1)
:::
The result of a floating-point remainder operation as computed by the %
operator is not the same as that produced by the remainder operation defined by IEEE 754, due to a different choice in rounding policy (4.2.4). The IEEE 754 remainder operation computes the remainder from a rounding division, an implied division using the round to nearest rounding policy, not a truncating division, an implied division using the round toward zero rounding policy, and so its behavior is not analogous to that of the usual integer remainder operator. Instead, the Java programming language defines %
on floating-point operations to behave in a manner analogous to that of the integer remainder operator, with an implied division using the round toward zero rounding policy; this may be compared with the C library function fmod
. The IEEE 754 remainder operation may be computed by the library routine Math.IEEEremainder
or StrictMath.IEEEremainder
.
The result of a floating-point remainder operation is determined by thethese rules of IEEE 754 arithmetic, which match the rules for IEEE 754 remainder other than how the implied division is computed:
If either operand is NaN, the result is NaN.
If the result is not NaN, the sign of the result equals the sign of the dividend.
If the dividend is an infinity, or the divisor is a zero, or both, the result is NaN.
If the dividend is finite and the divisor is an infinity, the result equals the dividend.
If the dividend is a zero and the divisor is finite, the result equals the dividend.
In the remaining cases, where neither an infinity, nor a zero, nor NaN is involved, the floating-point remainder r from the division of a dividend n by a divisor d is defined by the mathematical relation r = n - (d ⋅ q) where q is an integer that is negative only if n/d is negative and positive only if n/d is positive, and whose magnitude is as large as possible without exceeding the magnitude of the true mathematical quotient of n and d.
Evaluation of a floating-point remainder operator %
never throws a run-time exception, even if the right-hand operand is zero. Overflow, underflow, or loss of precision cannot occur.
:::example
Example 15.17.3-2. Floating-Point Remainder Operator
class Test2 {
public static void main(String[] args) {
double a = 5.0%3.0; // 2.0
System.out.println("5.0%3.0 produces " + a);
double b = 5.0%(-3.0); // 2.0
System.out.println("5.0%(-3.0) produces " + b);
double c = (-5.0)%3.0; // -2.0
System.out.println("(-5.0)%3.0 produces " + c);
double d = (-5.0)%(-3.0); // -2.0
System.out.println("(-5.0)%(-3.0) produces " + d);
}
}
This program produces the output:
5.0%3.0 produces 2.0
5.0%(-3.0) produces 2.0
(-5.0)%3.0 produces -2.0
(-5.0)%(-3.0) produces -2.0
:::
15.18 Additive Operators
15.18.2 Additive Operators (+
and -
) for Numeric Types
The binary +
operator performs addition when applied to two operands of numeric type, producing the sum of the operands.
The binary -
operator performs subtraction, producing the difference of two numeric operands.
Binary numeric promotion is performed on the operands (5.6).
Note that binary numeric promotion performs value set conversion (5.1.13) and may perform unboxing conversion (5.1.8).
The type of an additive expression on numeric operands is the promoted type of its operands.
If this promoted type is int
or long
, then integer arithmetic is performed.
If this promoted type is float
or double
, then floating-point arithmetic is performed.
Addition is a commutative operation if the operand expressions have no side effects.
Integer addition is associative when the operands are all of the same type.
Floating-point addition is not associative.
If an integer addition overflows, then the result is the low-order bits of the mathematical sum as represented in some sufficiently large two's-complement format. If overflow occurs, then the sign of the result is not the same as the sign of the mathematical sum of the two operand values.
The result of a floating-point addition is determined using the following rules of IEEE 754 arithmetic:
If either operand is NaN, the result is NaN.
The sum of two infinities of opposite sign is NaN.
The sum of two infinities of the same sign is the infinity of that sign.
The sum of an infinity and a finite value is equal to the infinite operand.
The sum of two zeros of opposite sign is positive zero.
The sum of two zeros of the same sign is the zero of that sign.
The sum of a zero and a nonzero finite value is equal to the nonzero operand.
The sum of two nonzero finite values of the same magnitude and opposite sign is positive zero.
In the remaining cases, where neither an infinity, nor a zero, nor NaN is involved, and the operands have the same sign or have different magnitudes, the exact mathematical sum is computed. A floating-point value set is then chosen:
If the addition expression is FP-strict (15.4):
If the type of the addition expression is
float
, then the float value set must be chosen.If the type of the addition expression is
double
, then the double value set must be chosen.
If the addition expression is not FP-strict:
If the type of the addition expression is
float
, then either the float value set or the float-extended-exponent value set may be chosen, at the whim of the implementation.If the type of the addition expression is
double
, then either the double value set or the double-extended-exponent value set may be chosen, at the whim of the implementation.
Next, a value must be chosen from the chosen value set to represent the sum.
If the magnitude of the sum is too large to represent, we say the operation overflows; the result is then an infinity of appropriate sign.
Otherwise, the sum is rounded to the nearest value in the chosen value set using
IEEE 754 round-to-nearest modethe round to nearest rounding policy. The Java programming language requires support of gradual underflowas defined by IEEE 754(4.2.4).
The binary -
operator performs subtraction when applied to two operands of numeric type, producing the difference of its operands; the left-hand operand is the minuend and the right-hand operand is the subtrahend.
For both integer and floating-point subtraction, it is always the case that a-b
produces the same result as a+(-b)
.
Note that, for integer values, subtraction from zero is the same as negation. However, for floating-point operands, subtraction from zero is not the same as negation, because if x is +0.0
, then 0.0-x
is +0.0
, but -x
is -0.0
.
Despite the fact that overflow, underflow, or loss of information may occur, evaluation of a numeric additive operator never throws a run-time exception.