This specification is not final and is subject to change. Use is subject to license terms.

Contextual Keywords

Changes to the Java® Language Specification • Version 16-internal+0-adhoc.gbierman.20201104

This document describes changes to the Java Language Specification to clarify the use of context when applying the lexical grammar, particularly in the identification of contextual keywords (formerly described as "restricted identifiers" and "restricted keywords").

We've experimented with two approaches for these keywords since Java SE 9. The first is to determine whether a character sequence is a keyword by describing where it appears in terms of the syntactic grammar. The second is to treat the keyword as an Identifier token, but later reference it literally, just like a keyword, in the syntactic grammar.

As we introduce new forms of contextual keywords like non-sealed, the second approach no longer works (non-sealed is a sequence of tokens, not a single identifier). So this revision standardizes on the first—contextual keywords are identified based on where they will appear in the syntactic grammar. See 3.9 for further discussion.

Changes are described with respect to existing sections of the JLS. New text is indicated like this and deleted text is indicated like this. Explanation and discussion, as needed, is set aside in grey boxes.

Chapter 3: Lexical Structure

3.2 Lexical Translations

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:

  1. A translation of Unicode escapes (3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.

  2. A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (3.4).

  3. A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (3.5) which, after white space (3.6) and comments (3.7) are discarded, comprise the tokens (3.5) that are the terminal symbols of the syntactic grammar (2.3).

The In general, the longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There is one exception: if lexical translation occurs in a type context (4.11) and the input stream has two or more consecutive > characters that are followed by a non-> character, then each > character must be translated to the token for the numerical comparison operator >. Some exceptions exist, as described in 3.3 and 3.5.

The input characters a--b are tokenized (3.5) as a, --, b, which is not part of any grammatically correct program, even though the tokenization a, -, -, b could be part of a grammatically correct program.

Without the rule for > characters, two consecutive > brackets in a type such as List<List<String>> would be tokenized as the signed right shift operator >>, while three consecutive > brackets in a type such as List<List<List<String>>> would be tokenized as the unsigned right shift operator >>>. Worse, the tokenization of four or more consecutive > brackets in a type such as List<List<List<List<String>>>> would be ambiguous, as various combinations of >, >>, and >>> tokens could represent the >>>> characters.

The previous assertion of just one exception was too strong.

In Step 1, the character sequence \\u1234 is treated as 7 distinct characters, not two (3.3). This is appropriate, but represents another exception to the "longest possible translation" rule.

In Step 3, contextual keywords may sometimes be hyphenated (though none are currently), and in some contexts the hyphen would be treated as a distinct token.

It's better to leave discussion about treatment of ambiguities to each section.

The dicussion about > characters has moved to 3.12.

3.5 Input Elements and Tokens

The input characters and line terminators that result from Unicode escape processing (3.3) and then input line recognition (3.4) are reduced to a sequence of input elements.

Input:
{InputElement} [Sub]
InputElement:
WhiteSpace
Comment
Token
Token:
Identifier
Keyword
Literal
Separator
Operator
Sub:
the ASCII SUB character, also known as "control-Z"

Those input elements that are not white space or comments are tokens. The tokens are the terminal symbols of the syntactic grammar (2.3).

The Input production is ambiguous, meaning that, for some sequences of input characters and line terminators, there is more than one way to match the Input production to the sequence. Ambiguities are resolved as follows:

White space (3.6) and comments (3.7) can serve to separate tokens that, if adjacent, might be tokenized in another manner. For example, the ASCII characters - and = in the input can form the operator token -= (3.12) only if there is no intervening white space or comment.

So the input characters staticvoid are interpreted as a single identifier token, while the input characters static void (with an ASCII SP character in between c and v) are interpreted as a pair of keyword tokens, static and void, separated by whitespace.

Similarly, the input characters a--b are interpreted as a, --, and b, which is not part of any grammatically correct program, even though the interpretation a, -, -, and b could be part of a grammatically correct program. The input characters a- -b (with an ASCII SP character in between the two - characters), on the other hand, are interpreted as a, -, -, and b.

As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a, or control-Z) is ignored if it is the last character in the escaped input stream.

Consider two tokens x and y in the resulting input stream. If x precedes y, then we say that x is to the left of y and that y is to the right of x.

For example, in this simple piece of code:

class Empty {
}

we say that the } token is to the right of the { token, even though it appears, in this two-dimensional representation, downward and to the left of the { token. This convention about the use of the words left and right allows us to speak, for example, of the right-hand operand of a binary operator or of the left-hand side of an assignment.

3.8 Identifiers

An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.

Identifier:
IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral
IdentifierChars:
JavaLetter {JavaLetterOrDigit}
JavaLetter:
any Unicode character that is a "Java letter"
JavaLetterOrDigit:
any Unicode character that is a "Java letter-or-digit"

A "Java letter" is a character for which the method Character.isJavaIdentifierStart(int) returns true.

A "Java letter-or-digit" is a character for which the method Character.isJavaIdentifierPart(int) returns true.

The "Java letters" include uppercase and lowercase ASCII Latin letters A-Z (\u0041-\u005a), and a-z (\u0061-\u007a), and, for historical reasons, the ASCII dollar sign ($, or \u0024) and underscore (_, or \u005f). The dollar sign should be used only in mechanically generated source code or, rarely, to access pre-existing names on legacy systems. The underscore may be used in identifiers formed of two or more characters, but it cannot be used as a one-character identifier due to being a keyword.

The "Java digits" include the ASCII digits 0-9 (\u0030-\u0039).

Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages.

An identifier cannot have the same spelling (Unicode character sequence) as a keyword (3.9), boolean literal (3.10.3), or the null literal (3.10.8), or a compile-time error occurs.

A sequence of input characters does not represent an identifier if (in a particular context) it represents a keyword (3.9), a boolean literal (3.10.3), or the null literal (3.10.8).

Two problems with the former phrasing:

Two identifiers are the same only if, after ignoring characters that are ignorable, the identifiers have the same Unicode character for each letter or digit. An ignorable character is a character for which the method Character.isIdentifierIgnorable(int) returns true. Identifiers that have the same external appearance may yet be different.

For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (A, \u0041), LATIN SMALL LETTER A (a, \u0061), GREEK CAPITAL LETTER ALPHA (A, \u0391), CYRILLIC SMALL LETTER A (a, \u0430) and MATHEMATICAL BOLD ITALIC SMALL A (a, \ud835\udc82) are all different.

Unicode composite characters are different from their canonical equivalent decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) is different from a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) in identifiers. See The Unicode Standard, Section 3.11 "Normalization Forms".

Examples of identifiers are:

The identifiers var and yield are restricted identifiers because they are not allowed in some contexts.

In this revised approach, var and yield are contextual keywords. We keep the same restrictions on uses of var and yield, but the term "restricted identifier" no longer works—in contexts where they are "restricted", they may not be identifiers at all.

In some contexts, to facilitate the recognition of contextual keywords (3.9), the syntactic grammar disallows certain identifiers by defining a production in terms of a subset of identifiers. These subsets are defined as follows:

A type identifier is an identifier that is not the character sequence var or the character sequence yield.

TypeIdentifier:
Identifier but not var or yield

Type identifiers are A TypeIdentifier is used in certain contexts involving the declaration or use of types. For example, the name of a class must be a TypeIdentifier, so it is illegal to declare a class named var or yield (8.1).

An unqualified method identifier is an identifier that is not the character sequence yield.

UnqualifiedMethodIdentifier:
Identifier but not yield

This restriction allows yield to be used in a yield statement (14.21) and still also be used as a (qualified) method name for compatibility reasons.

An UnqualifiedMethodIdentifier is used when referencing a method with a single identifier. An invocation of a method named yield must be qualified, to distinguish the invocation from a yield statement.

The formal term "type identifier" is only used once outside of this section (6.1), and "unqualified method identifier" is never used. It's simpler just to stick with the grammar production names.

The revised discussion about UnqualifiedMethodIdentifier mimics the TypeIdentifier discussion, describing where the restriction applies rather than getting into the design motivation. (Design motivation comes from the new introductory sentence, above.)

3.9 Keywords

51 character sequences, formed from ASCII letters characters, are reserved for use as keywords and cannot be used as identifiers (3.8). Another 12 character sequences, also formed from ASCII characters, may be interpreted as keywords, depending on the context in which they appear.

Note that _ is not an ASCII letter. Neither is -, which is expected to appear in some contextual keywords like non-sealed.

Keyword:
ReservedKeyword
ContextualKeyword
ReservedKeyword:
(one of)
abstract continue for new switch
assert default if package synchronized
boolean do goto private this
break double implements protected throw
byte else import public throws
case enum instanceof return transient
catch extends int short try
char final interface static void
class finally long strictfp volatile
const float native super while
_ (underscore)
ContextualKeyword:
(one of)
exports opens to var
module provides transitive with
open requires uses yield

The keywords const and goto are reserved, even though they are not currently used. This may allow a Java compiler to produce better error messages if these C++ keywords incorrectly appear in programs. The keyword _ (underscore) is reserved for possible future use in parameter declarations.

A character sequence matching a contextual keyword is not treated as a keyword if any part of the sequence can be combined with the immediately preceding or following characters to form a different token.

So the character sequence openmodule is interpreted as a single identifier rather than two contextual keywords, even at the start of a ModuleDeclaration. If two keywords are intended, they must be separated by whitespace or a comment.

Any other character sequence matching a contextual keyword is treated as a keyword if and only if it appears in one of the following contexts of the syntactic grammar:

While these rules depend on details of the syntactic grammar, a compiler for the Java Programming Language can implement them without fully parsing the input program. For example, a heuristic could be used to track the contextual state of the tokenizer, as long as the heuristic guarantees that valid uses of contextual keywords are tokenized as keywords, and valid uses of identifiers are tokenized as identifiers. Or the compiler could always tokenize a contextual keyword as an identifier, leaving it to later stages to recognize special uses of these identifiers.

We're being intentionally vague about how an implementation might disambiguate. The above note should be enough to make clear that this is an implementation choice, not something we're going to specify explicitly. Still, as the language evolves, designers will need to be careful about the implementation impact of new contextual keywords in certain contexts.

A variety of character sequences are sometimes assumed, incorrectly, to be keywords:

A further ten character sequences are restricted keywords: open, module, requires, transitive, exports, opens, to, uses, provides, and with. These character sequences are tokenized as keywords solely where they appear as terminals in the ModuleDeclaration, ModuleDirective, and RequiresModifier productions (7.7). They are tokenized as identifiers everywhere else, for compatibility with programs written before the introduction of restricted keywords. There is one exception: immediately to the right of the character sequence requires in the ModuleDirective production, the character sequence transitive is tokenized as a keyword unless it is followed by a separator, in which case it is tokenized as an identifier.

These rules are captured in the new rules, above.

The rephrasing "when appearing as specified" avoids the need for a special exception for transitive: as noted, if the grammar isn't looking for a RequiresModifier, then transitive isn't in an appropriate context to be treated as a keyword.

3.12 Operators

38 tokens, formed from ASCII characters, are the operators.

Operator:
(one of)
= > < ! ~ ? : ->
== >= <= != && || ++ --
+ - * / & | ^ % << >> >>>
+= -= *= /= &= |= ^= %= <<= >>= >>>=

When the character > appears in a type context (4.11)—that is, as part of a Type or an UnannType in the syntactic grammar (4.1, 8.3)—it is always treated as the > operator, even when it could be combined with an adjacent > character to form a different operator.

So the character sequence List<List<String>>, when appearing as a Type, ends with two > operators.

Without this rule for > characters, two consecutive > brackets in a type such as List<List<String>> would be tokenized (3.5) as the signed right shift operator >>, while three consecutive > brackets in a type such as List<List<List<String>>> would be tokenized as the unsigned right shift operator >>>. Worse, the tokenization of four or more consecutive > brackets in a type such as List<List<List<List<String>>>> would be ambiguous, as various combinations of >, >>, and >>> tokens could represent the >>>> characters.

Chapter 6: Names

6.1 Declarations

A declaration introduces an entity into a program and includes an identifier (3.8) that can be used in a name to refer to this entity. The identifier is constrained to be a type identifier TypeIdentifier when the entity being introduced is a class, interface, or type parameter.

...

6.5 Determining the Meaning of a Name

The meaning of a name depends on the context in which it is used. The determination of the meaning of a name requires three steps:

ModuleName:
Identifier
ModuleName . Identifier
PackageName:
Identifier
PackageName . Identifier
TypeName:
TypeIdentifier
PackageOrTypeName . TypeIdentifier
PackageOrTypeName:
Identifier
PackageOrTypeName . Identifier
ExpressionName:
Identifier
AmbiguousName . Identifier
MethodName:
UnqualifiedMethodIdentifier
AmbiguousName:
Identifier
AmbiguousName . Identifier

The use of context helps to minimize name conflicts between entities of different kinds. Such conflicts will be rare if the naming conventions described in 6.1 are followed. Nevertheless, conflicts may arise unintentionally as types developed by different programmers or different organizations evolve. For example, types, methods, and fields may have the same name. It is always possible to distinguish between a method and a field with the same name, since the context of a use always tells whether a method is intended.

6.5.7 Meaning of Method Names

6.5.7.1 Simple Method Names

A simple method name appears in the context of a method invocation expression (15.12). The simple method name consists of a single Identifier UnqualifiedMethodIdentifier which specifies the name of the method to be invoked. The rules of method invocation require that the Identifier UnqualifiedMethodIdentifier either denotes a method that is in scope at the point of the method invocation, or denotes a method imported by a single-static-import declaration or static-import-on-demand declaration (7.5.3, 7.5.4).

Example 6.5.7.1-1. Simple Method Names

The following program demonstrates the role of scoping when determining which method to invoke.

class Super {
    void f2(String s)       {}
    void f3(String s)       {}
    void f3(int i1, int i2) {}
}

class Test {
    void f1(int i) {}
    void f2(int i) {}
    void f3(int i) {}

    void m() {
        new Super() {
            {
                f1(0);  // OK, resolves to Test.f1(int)
                f2(0);  // compile-time error
                f3(0);  // compile-time error
            }
        };
    }
}

For the invocation f1(0), only one method named f1 is in scope. It is the method Test.f1(int), whose declaration is in scope throughout the body of Test including the anonymous class declaration. 15.12.1 chooses to search in class Test since the anonymous class declaration has no member named f1. Eventually, Test.f1(int) is resolved.

For the invocation f2(0), two methods named f2 are in scope. First, the declaration of the method Super.f2(String) is in scope throughout the anonymous class declaration. Second, the declaration of the method Test.f2(int) is in scope throughout the body of Test including the anonymous class declaration. (Note that neither declaration shadows the other, because at the point where each is declared, the other is not in scope.) 15.12.1 chooses to search in class Super because it has a member named f2. However, Super.f2(String) is not applicable to f2(0), so a compile-time error occurs. Note that class Test is not searched.

For the invocation f3(0), three methods named f3 are in scope. First and second, the declarations of the methods Super.f3(String) and Super.f3(int,int) are in scope throughout the anonymous class declaration. Third, the declaration of the method Test.f3(int) is in scope throughout the body of Test including the anonymous class declaration. 15.12.1 chooses to search in class Super because it has a member named f3. However, Super.f3(String) and Super.f3(int,int) are not applicable to f3(0), so a compile-time error occurs. Note that class Test is not searched.

Choosing to search a nested class's superclass hierarchy before the lexically enclosing scope is called the "comb rule" (15.12.1).