This document describes changes to the Java Language Specification to clarify the use of context when applying the lexical grammar, particularly in the identification of contextual keywords (formerly described as "restricted identifiers" and "restricted keywords").
We've experimented with two approaches for these keywords since Java SE 9. The first is to determine whether a character sequence is a keyword by describing where it appears in terms of the syntactic grammar. The second is to treat the keyword as an Identifier token, but later reference it literally, just like a keyword, in the syntactic grammar.
As we introduce new forms of contextual keywords like non-sealed
, the second approach no longer works (non-sealed
is a sequence of tokens, not a single identifier). So this revision standardizes on the first—contextual keywords are identified based on where they will appear in the syntactic grammar. See 3.9 for further discussion.
Changes are described with respect to existing sections of the JLS. New text is indicated like this and deleted text is indicated like this. Explanation and discussion, as needed, is set aside in grey boxes.
Chapter 3: Lexical Structure
3.2 Lexical Translations
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form
\uxxxx
, wherexxxx
is a hexadecimal value, represents the UTF-16 code unit whose encoding isxxxx
. This translation step allows any program to be expressed using only ASCII characters.A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (3.4).
A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (3.5) which, after white space (3.6) and comments (3.7) are discarded, comprise the tokens (3.5) that are the terminal symbols of the syntactic grammar (2.3).
The In general, the longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There is one exception: if lexical translation occurs in a type context (4.11) and the input stream has two or more consecutive Some exceptions exist, as described in 3.3 and 3.5.>
characters that are followed by a non->
character, then each >
character must be translated to the token for the numerical comparison operator >
.
The input characters
a--b
are tokenized (3.5) asa
,--
,b
, which is not part of any grammatically correct program, even though the tokenizationa
,-
,-
,b
could be part of a grammatically correct program.Without the rule for
>
characters, two consecutive>
brackets in a type such asList<List<String>>
would be tokenized as the signed right shift operator>>
, while three consecutive>
brackets in a type such asList<List<List<String>>>
would be tokenized as the unsigned right shift operator>>>
. Worse, the tokenization of four or more consecutive>
brackets in a type such asList<List<List<List<String>>>>
would be ambiguous, as various combinations of>
,>>
, and>>>
tokens could represent the>>>>
characters.
The previous assertion of just one exception was too strong.
In Step 1, the character sequence \\u1234
is treated as 7 distinct characters, not two (3.3). This is appropriate, but represents another exception to the "longest possible translation" rule.
In Step 3, contextual keywords may sometimes be hyphenated (though none are currently), and in some contexts the hyphen would be treated as a distinct token.
It's better to leave discussion about treatment of ambiguities to each section.
The dicussion about >
characters has moved to 3.12.
3.5 Input Elements and Tokens
The input characters and line terminators that result from Unicode escape processing (3.3) and then input line recognition (3.4) are reduced to a sequence of input elements.
- Input:
- {InputElement} [Sub]
- InputElement:
- WhiteSpace
- Comment
- Token
- Token:
- Identifier
- Keyword
- Literal
- Separator
- Operator
- Sub:
- the ASCII SUB character, also known as "control-Z"
Those input elements that are not white space or comments are tokens. The tokens are the terminal symbols of the syntactic grammar (2.3).
The Input production is ambiguous, meaning that, for some sequences of input characters and line terminators, there is more than one way to match the Input production to the sequence. Ambiguities are resolved as follows:
Depending on context, as described in 3.9, a number of specific character sequences are sometimes interpreted as contextual keywords, and sometimes treated as other non-keyword tokens.
Depending on context, as described in 3.12, the input character
>
is sometimes interpreted as an operator, and sometimes combined with adjacent characters (as in>>
) to form a different operator.In every other circumstance, the longest matching token is always preferred, even if the result does not ultimately make a correct program while another lexical translation would.
White space (3.6) and comments (3.7) can serve to separate tokens that, if adjacent, might be tokenized in another manner. For example, the ASCII characters -
and =
in the input can form the operator token -=
(3.12) only if there is no intervening white space or comment.
So the input characters
staticvoid
are interpreted as a single identifier token, while the input charactersstatic void
(with an ASCII SP character in betweenc
andv
) are interpreted as a pair of keyword tokens,static
andvoid
, separated by whitespace.Similarly, the input characters
a--b
are interpreted asa
,--
, andb
, which is not part of any grammatically correct program, even though the interpretationa
,-
,-
, andb
could be part of a grammatically correct program. The input charactersa- -b
(with an ASCII SP character in between the two-
characters), on the other hand, are interpreted asa
,-
,-
, andb
.
As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a
, or control-Z) is ignored if it is the last character in the escaped input stream.
Consider two tokens x
and y
in the resulting input stream. If x
precedes y
, then we say that x
is to the left of y
and that y
is to the right of x
.
For example, in this simple piece of code:
class Empty { }
we say that the
}
token is to the right of the{
token, even though it appears, in this two-dimensional representation, downward and to the left of the{
token. This convention about the use of the words left and right allows us to speak, for example, of the right-hand operand of a binary operator or of the left-hand side of an assignment.
3.8 Identifiers
An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.
- Identifier:
- IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
- IdentifierChars:
- JavaLetter {JavaLetterOrDigit}
- JavaLetter:
- any Unicode character that is a "Java letter"
- JavaLetterOrDigit:
- any Unicode character that is a "Java letter-or-digit"
A "Java letter" is a character for which the method Character.isJavaIdentifierStart(int)
returns true.
A "Java letter-or-digit" is a character for which the method Character.isJavaIdentifierPart(int)
returns true.
The "Java letters" include uppercase and lowercase ASCII Latin letters
A-Z
(\u0041-\u005a
), anda-z
(\u0061-\u007a
), and, for historical reasons, the ASCII dollar sign ($
, or\u0024
) and underscore (_
, or\u005f
). The dollar sign should be used only in mechanically generated source code or, rarely, to access pre-existing names on legacy systems. The underscore may be used in identifiers formed of two or more characters, but it cannot be used as a one-character identifier due to being a keyword.
The "Java digits" include the ASCII digits
0-9
(\u0030-\u0039
).
Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages.
An identifier cannot have the same spelling (Unicode character sequence) as a keyword (3.9), boolean literal (3.10.3), or the null literal (3.10.8), or a compile-time error occurs.
A sequence of input characters does not represent an identifier if (in a particular context) it represents a keyword (3.9), a boolean literal (3.10.3), or the null literal (3.10.8).
Two problems with the former phrasing:
"Have the same spelling" is unclear about whether it takes context into account. Looking purely at their characters, certain identifiers can, in fact, have the same spelling as certain keywords.
A programmer accidentally using a keyword where they intended to use an identifier won't necessarily get a compile-time error—it's also possible that the program is just interpreted differently than expected. (E.g., in a class body,
Public Foo() { ... }
is a method declaration, whilepublic Foo() { ... }
is a constructor declaration.)
Two identifiers are the same only if, after ignoring characters that are ignorable, the identifiers have the same Unicode character for each letter or digit. An ignorable character is a character for which the method Character.isIdentifierIgnorable(int)
returns true. Identifiers that have the same external appearance may yet be different.
For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (
A
,\u0041
), LATIN SMALL LETTER A (a
,\u0061
), GREEK CAPITAL LETTER ALPHA (A
,\u0391
), CYRILLIC SMALL LETTER A (a
,\u0430
) and MATHEMATICAL BOLD ITALIC SMALL A (a
,\ud835\udc82
) are all different.Unicode composite characters are different from their canonical equivalent decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (
Á
,\u00c1
) is different from a LATIN CAPITAL LETTER A (A
,\u0041
) immediately followed by a NON-SPACING ACUTE (´
,\u0301
) in identifiers. See The Unicode Standard, Section 3.11 "Normalization Forms".
Examples of identifiers are:
String
i3
- αρετη
MAX_VALUE
isLetterOrDigit
The identifiers var
and yield
are restricted identifiers because they are not allowed in some contexts.
In this revised approach, var
and yield
are contextual keywords. We keep the same restrictions on uses of var
and yield
, but the term "restricted identifier" no longer works—in contexts where they are "restricted", they may not be identifiers at all.
In some contexts, to facilitate the recognition of contextual keywords (3.9), the syntactic grammar disallows certain identifiers by defining a production in terms of a subset of identifiers. These subsets are defined as follows:
A type identifier is an identifier that is not the character sequence var
or the character sequence yield
.
- TypeIdentifier:
- Identifier but not
var
oryield
Type identifiers areA TypeIdentifier is used in certain contexts involving the declaration or use of types. For example, the name of a class must be a TypeIdentifier, so it is illegal to declare a class namedvar
oryield
(8.1).
An unqualified method identifier is an identifier that is not the character sequence yield
.
- UnqualifiedMethodIdentifier:
- Identifier but not
yield
This restriction allowsyield
to be used in ayield
statement (14.21) and still also be used as a (qualified) method name for compatibility reasons.
An UnqualifiedMethodIdentifier is used when referencing a method with a single identifier. An invocation of a method named
yield
must be qualified, to distinguish the invocation from ayield
statement.
The formal term "type identifier" is only used once outside of this section (6.1), and "unqualified method identifier" is never used. It's simpler just to stick with the grammar production names.
The revised discussion about UnqualifiedMethodIdentifier mimics the TypeIdentifier discussion, describing where the restriction applies rather than getting into the design motivation. (Design motivation comes from the new introductory sentence, above.)
3.9 Keywords
51 character sequences, formed from ASCII letters characters, are reserved for use as keywords and cannot be used as identifiers (3.8). Another 12 character sequences, also formed from ASCII characters, may be interpreted as keywords, depending on the context in which they appear.
Note that _
is not an ASCII letter. Neither is -
, which is expected to appear in some contextual keywords like non-sealed
.
- Keyword:
- ReservedKeyword
- ContextualKeyword
- ReservedKeyword:
- (one of)
abstract continue for new switch
assert default if package synchronized
boolean do goto private this
break double implements protected throw
byte else import public throws
case enum instanceof return transient
catch extends int short try
char final interface static void
class finally long strictfp volatile
const float native super while
_
(underscore)
- ContextualKeyword:
- (one of)
exports opens to var
module provides transitive with
open requires uses yield
The keywords
const
andgoto
are reserved, even though they are not currently used. This may allow a Java compiler to produce better error messages if these C++ keywords incorrectly appear in programs. The keyword_
(underscore) is reserved for possible future use in parameter declarations.
A character sequence matching a contextual keyword is not treated as a keyword if any part of the sequence can be combined with the immediately preceding or following characters to form a different token.
So the character sequence
openmodule
is interpreted as a single identifier rather than two contextual keywords, even at the start of a ModuleDeclaration. If two keywords are intended, they must be separated by whitespace or a comment.
Any other character sequence matching a contextual keyword is treated as a keyword if and only if it appears in one of the following contexts of the syntactic grammar:
For
open
andmodule
, when appearing as specified as a terminal in a ModuleDeclaration (7.7)For
requires
,exports
,opens
,uses
,provides
,to
, andwith
, when appearing as specified as a terminal in a ModuleDirectiveFor
transitive
, when appearing as specified as a terminal in a RequiresModifierThe directive
requires transitive;
does not make use of RequiresModifier, and so in this casetransitive
is interpreted as an identifier.For
var
, when appearing as specified as a terminal in a LocalVariableType (14.4) or a LambdaParameterType (15.27.1)In many other contexts, attempting to use the character sequence
var
as an identifier will cause an error, becausevar
is not a valid TypeIdentifier (3.8).For
yield
, when appearing as specified as a terminal in a YieldStatement (14.21)In many other contexts, attempting to use the character sequence
yield
as an identifier will cause an error, becauseyield
is neither a valid TypeIdentifier nor a valid UnqualifiedMethodIdentifier.
While these rules depend on details of the syntactic grammar, a compiler for the Java Programming Language can implement them without fully parsing the input program. For example, a heuristic could be used to track the contextual state of the tokenizer, as long as the heuristic guarantees that valid uses of contextual keywords are tokenized as keywords, and valid uses of identifiers are tokenized as identifiers. Or the compiler could always tokenize a contextual keyword as an identifier, leaving it to later stages to recognize special uses of these identifiers.
We're being intentionally vague about how an implementation might disambiguate. The above note should be enough to make clear that this is an implementation choice, not something we're going to specify explicitly. Still, as the language evolves, designers will need to be careful about the implementation impact of new contextual keywords in certain contexts.
A variety of character sequences are sometimes assumed, incorrectly, to be keywords:
true
andfalse
are not keywords, but rather boolean literals (3.10.3).
null
is not a keyword, but rather the null literal (3.10.8).
var
andyield
are not keywords, but rather restricted identifiers (3.8).var
has special meaning as the type of a local variable declaration (14.4, 14.14.1, 14.14.2, 14.20.3) and the type of a lambda formal parameter (15.27.1).yield
has special meaning in ayield
statement (14.21). All invocations of a method namedyield
must be qualified so as to be distinguished from ayield
statement.
A further ten character sequences are restricted keywords: open
, module
, requires
, transitive
, exports
, opens
, to
, uses
, provides
, and with
. These character sequences are tokenized as keywords solely where they appear as terminals in the ModuleDeclaration, ModuleDirective, and RequiresModifier productions (7.7). They are tokenized as identifiers everywhere else, for compatibility with programs written before the introduction of restricted keywords. There is one exception: immediately to the right of the character sequence requires
in the ModuleDirective production, the character sequence transitive
is tokenized as a keyword unless it is followed by a separator, in which case it is tokenized as an identifier.
These rules are captured in the new rules, above.
The rephrasing "when appearing as specified" avoids the need for a special exception for transitive
: as noted, if the grammar isn't looking for a RequiresModifier, then transitive
isn't in an appropriate context to be treated as a keyword.
3.12 Operators
38 tokens, formed from ASCII characters, are the operators.
- Operator:
- (one of)
= > < ! ~ ? : ->
== >= <= != && || ++ --
+ - * / & | ^ % << >> >>>
+= -= *= /= &= |= ^= %= <<= >>= >>>=
When the character >
appears in a type context (4.11)—that is, as part of a Type or an UnannType in the syntactic grammar (4.1, 8.3)—it is always treated as the >
operator, even when it could be combined with an adjacent >
character to form a different operator.
So the character sequence
List<List<String>>
, when appearing as a Type, ends with two>
operators.Without this rule for
>
characters, two consecutive>
brackets in a type such asList<List<String>>
would be tokenized (3.5) as the signed right shift operator>>
, while three consecutive>
brackets in a type such asList<List<List<String>>>
would be tokenized as the unsigned right shift operator>>>
. Worse, the tokenization of four or more consecutive>
brackets in a type such asList<List<List<List<String>>>>
would be ambiguous, as various combinations of>
,>>
, and>>>
tokens could represent the>>>>
characters.
Chapter 6: Names
6.1 Declarations
A declaration introduces an entity into a program and includes an identifier (3.8) that can be used in a name to refer to this entity. The identifier is constrained to be a type identifier TypeIdentifier when the entity being introduced is a class, interface, or type parameter.
...
6.5 Determining the Meaning of a Name
The meaning of a name depends on the context in which it is used. The determination of the meaning of a name requires three steps:
First, context causes a name syntactically to fall into one of seven categories: ModuleName, PackageName, TypeName, ExpressionName, MethodName, PackageOrTypeName, or AmbiguousName.
TypeName and MethodName are less expressive than the other five categories, because they are denoted with TypeIdentifier and UnqualifiedMethodIdentifier (3.8), respectively.
The former excludes the character sequencesvar
andyield
(3.8), and the latter excludes the character sequenceyield
.This sentence is bound to become out of sync as new contextual keywords are added. It's repeating the discussion from 3.8. A cross-reference is sufficient.
Second, a name that is initially classified by its context as an AmbiguousName or as a PackageOrTypeName is then reclassified to be a PackageName, TypeName, or ExpressionName.
Third, the resulting category then dictates the final determination of the meaning of the name (or a compile-time error if the name has no meaning).
- ModuleName:
- Identifier
- ModuleName
.
Identifier - PackageName:
- Identifier
- PackageName
.
Identifier - TypeName:
- TypeIdentifier
- PackageOrTypeName
.
TypeIdentifier - PackageOrTypeName:
- Identifier
- PackageOrTypeName
.
Identifier - ExpressionName:
- Identifier
- AmbiguousName
.
Identifier - MethodName:
- UnqualifiedMethodIdentifier
- AmbiguousName:
- Identifier
- AmbiguousName
.
Identifier
The use of context helps to minimize name conflicts between entities of different kinds. Such conflicts will be rare if the naming conventions described in 6.1 are followed. Nevertheless, conflicts may arise unintentionally as types developed by different programmers or different organizations evolve. For example, types, methods, and fields may have the same name. It is always possible to distinguish between a method and a field with the same name, since the context of a use always tells whether a method is intended.
6.5.7 Meaning of Method Names
6.5.7.1 Simple Method Names
A simple method name appears in the context of a method invocation expression (15.12). The simple method name consists of a single Identifier UnqualifiedMethodIdentifier which specifies the name of the method to be invoked. The rules of method invocation require that the Identifier UnqualifiedMethodIdentifier either denotes a method that is in scope at the point of the method invocation, or denotes a method imported by a single-static-import declaration or static-import-on-demand declaration (7.5.3, 7.5.4).
Example 6.5.7.1-1. Simple Method Names
The following program demonstrates the role of scoping when determining which method to invoke.
class Super {
void f2(String s) {}
void f3(String s) {}
void f3(int i1, int i2) {}
}
class Test {
void f1(int i) {}
void f2(int i) {}
void f3(int i) {}
void m() {
new Super() {
{
f1(0); // OK, resolves to Test.f1(int)
f2(0); // compile-time error
f3(0); // compile-time error
}
};
}
}
For the invocation f1(0)
, only one method named f1
is in scope. It is the method Test.f1(int)
, whose declaration is in scope throughout the body of Test
including the anonymous class declaration. 15.12.1 chooses to search in class Test
since the anonymous class declaration has no member named f1
. Eventually, Test.f1(int)
is resolved.
For the invocation f2(0)
, two methods named f2
are in scope. First, the declaration of the method Super.f2(String)
is in scope throughout the anonymous class declaration. Second, the declaration of the method Test.f2(int)
is in scope throughout the body of Test
including the anonymous class declaration. (Note that neither declaration shadows the other, because at the point where each is declared, the other is not in scope.) 15.12.1 chooses to search in class Super
because it has a member named f2
. However, Super.f2(String)
is not applicable to f2(0)
, so a compile-time error occurs. Note that class Test
is not searched.
For the invocation f3(0)
, three methods named f3
are in scope. First and second, the declarations of the methods Super.f3(String)
and Super.f3(int,int)
are in scope throughout the anonymous class declaration. Third, the declaration of the method Test.f3(int)
is in scope throughout the body of Test
including the anonymous class declaration. 15.12.1 chooses to search in class Super
because it has a member named f3
. However, Super.f3(String)
and Super.f3(int,int)
are not applicable to f3(0)
, so a compile-time error occurs. Note that class Test
is not searched.
Choosing to search a nested class's superclass hierarchy before the lexically enclosing scope is called the "comb rule" (15.12.1).