Updating URI support for RFC 3986 and RFC 3987 in the JDK

Motivation

java.net.URI claims conformance to RFC 2396 (Uniform Resource Identifiers (URI): Generic Syntax) and RFC 2732 (Format for Literal IPv6 Addresses in URLs), with deviations. These two RFCs have been obsoleted by RFC 3986, which is now an Internet Standard. In addition, java.net.URI supports non US-ASCII characters in URIs, which has been later formalized by RFC 3987. Because java.net.URI predates both RFC 3986 and RFC 3987 the documentation makes no mention on how java.net.URI deviates from these two RFCs.

Many bugs and confusion could be avoided if java.net.URI could be upgraded to support the newer RFCs.

Historical consideration

An attempt was made in the past to upgrade java.net.URI to support RFC 3986 and RFC 3987. This caused many JCK tests failures and compatibility issues and the change had to be backed out (see JDK-6394131). To reduce the risk of incompatibilities arising again several roads have been explored.

Significant differences between RCF 2396/RFC 2732 versus RFC 3986/RFC 3987 which impact compatibilty in java.net.URI

RFC 3986, Appendix D has an exhaustive list of differences between RFC 2396 and RFC 3986. Because java.net.URI supports RFC 2386 and RFC 2732 with deviations this list is not an exact match for java.net.URI, but is still very useful as a starting point to understand the impact on java.net.URI.

The most significant differences however are:

The table below shows the differences in legal characters accepted for each component, for java.net.URI versus RFC 3986:

    +-----------+------------------------------------------------------+-------------------------------------------------------+-------------------+
    | component | java.net.URI (current implementation)                | RFC 3986                                              | Diff              |
    +-----------+------------------------------------------------------+-------------------------------------------------------+-------------------+
    | scheme    | [A-Za-z0-9] + "+-."                                  | [A-Za-z0-9] + "+-."                                   | same              |
    | user      | [A-Za-z0-9] + "-_.!~*'()" + ";:&=+$," + %-enc        | [A-Za-z0-9] + "-._~" + "!$&'()*+,;=" + ":" + %-enc    | same              |
    | reg_name  | <user> + "@"                                         | <user> without ':'                                    | old = new + "@:"  |
    | path      | [A-Za-z0-9] + "-_.!~*'()" + ":@&=+$," + ";/" + %-enc | [A-Za-z0-9] + "-._~" + "!$&'()*+,;=" + ":@/" + %-enc  | same              |
    | opaque    | [A-Za-z0-9] + "-_.!~*'()" + ";/?:@&=+$,[]" + %-enc   | [A-Za-z0-9] + "-._~" + "!$&'()*+,;=" + ":@/" + %-enc  | old = new + "[]?" |
    | query     | [A-Za-z0-9] + "-_.!~*'()" + ";/?:@&=+$,[]" + %-enc   | [A-Za-z0-9] + "-._~" + "!$&'()*+,;=" + ":@/?" + %-enc | old = new + "[]"  |
    | fragment  | [A-Za-z0-9] + "-_.!~*'()" + ";/?:@&=+$,[]" + %-enc   | [A-Za-z0-9] + "-._~" + "!$&'()*+,;=" + ":@/?" + %-enc | old = new + "[]"  |
    +-----------+------------------------------------------------------+------------------------ ------------------------------+-------------------+

basically this means that, if we compare the current implementation of java.net.URI with what RFC 3986 mandates:

At this point a table showing some concrete examples might be the best way to visualize what could be the impact on java.net.URI if it were to be updated to conform to the newer RFCs.
If java.net.URI was updated to support RFC 3986 / RFC 3987 then we could chose not to support some of these differences and list them as new deviations. Others would be more awkward to justify.

Differences between RFC 2396 and RFC 3986 and impacts on java.net.URI
Examples RFC 2396 / java.net.URI RFC 3986 / 3987
Parsing authority
Parsing rules authority = server | reg_name
server = [[userinfo "@"] hostport]
authority = [userinfo "@"] host [":" port]
host = IP-literal / IPv4address / reg-name
"http://example.com:-1/foo/" URI.getHost() = null,
URI.getAuthority = "example.com:-1"
URISyntaxException (port = -1)
"http://1:2:3/foo/" URI.getHost() = null,
URI.getAuthority = "1:2:3"
URISyntaxException (illegal character in port number)
"http://u@v@w/foo/" URI.getHost() = null,
URI.getAuthority = "u@v@w"
URISyntaxException (illegal character in hostname)
"http://" URISyntaxException (Expected authority at index 7) URI.getScheme() = "http", URI.getAuthority() = ""
"//" URISyntaxException (Expected authority at index 2) URI.getAuthority() = ""
"http://u@x_y.com:42/foo/" URI.getHost() = null, URI.getUserInfo() = null, URI.getPort() = -1
URI.getAuthority = "u@x_y.com:42"
URI.getHost() = "x_y.com", URI.getUserInfo() = "u", URI.getPort() = 42
URI.getAuthority = "u@x_y.com:42"
"http://%41%42%43.com/foo/" URI.getHost() = null,
URI.getAuthority = "ABC.com"
URI.getHost() = "ABC.com",
URI.getAuthority = "ABC.com"
"file:///foo" URI.getAuthority() = null URI.getAuthority() = ""
Parsing Path
"about:" URISyntaxException (empty path) URI.getScheme() = "about"
"mailto:x.y@z.com" URI.getPath() = null,
URI.getSchemeSpecificPath() = "x.y@z.com"
URI.getPath() = "x.y@z.com",
URI.getSchemeSpecificPath() (obsolete)
"urn:isbn:096139210?x" URI.getPath() = null, uri.getQuery() = null,
URI.getSchemeSpecificPath() = "isbn:096139210?x"
URI.getPath() = "isbn:096139210", uri.getQuery() = "x"
URI.getSchemeSpecificPath() (obsolete)
"http://?hmmm" URI.getAuthority() = null, URI.getPath() = "", URI.getQuery() = "hmmm" URI.getAuthority() = "", URI.getPath() = "", URI.getQuery() = "hmmm"
"http://#hmmm" URI.getAuthority() = null, URI.getPath() = "", URI.getFragment() = "hmmm" URI.getAuthority() = "", URI.getPath() = "", URI.getFragment() = "hmmm"
"http:?hmmm" URI.getQuery() = null, URI.isOpaque() = true, URI.getPath() = null,
URI.getSchemeSpecificPart() = "?hmmm"
URI.getQuery() = "hmmm", URI.isOpaque() = false, URI.getPath() = "",
URI.getSchemeSpecificPath() (obsolete)
"http:#hmmm" URISyntaxException: Expected scheme-specific part at index 5 URI.getFragment() = "hmmm", URI.isOpaque() = false, URI.getPath() = ""
Normalization
"s://h/a/../../b" "s://h/../b" "s://h/b"
Resolution
"s://h/a/c".resolve("../../b") "s://h/../b" "s://h/b"
"s://h/a/c".resolve("") "s://h/a/" "s://h/a/c"
"s://h/a/c".resolve("?x=y") "s://h/a/?x=y" "s://h/a/c?x=y"
"s://h/a/c".resolve("#x=y") "s://h/a/c#x=y" "s://h/a/c#x=y" (same)
"s://h/a/c".resolve("/././x") "s://h/././x" (bug?) "s://h/x"

The table above shows the challenge of providing support for the newer RFC in a backward compatible way. There are also more differences in the way that java.net.URI and RFC 3987 restrict non ASCII characters. The following section lists the alternatives that have been envisaged.

Possible Solutions

Several different solutions have been explored. Two of them have been prototyped.

  1. Do nothing. Just stay with the old RFCs (what we have now)
  2. Update URI in a major version, force old code to adapt (rejected: basically what has been attempted before, and which lead to a backout)
  3. Same as above - but add a big switch, e.g: a system property, to select conformance to the new or old RFC (rejected).
  4. Same as above, but on a per-URI instance basis, depending on how the URI is constructed (rejected: too many combinatorial issues, see below).
  5. Add a new subclass of java.net.URI (rejected: URI is final but has public constructors, we can't remove the final keyword. Even if we could, this would still present major compatibility risks when passing the subclass to old code expecting the super class behavior, and we would stumble on the same issues than with the previous bullet).
  6. Leave java.net.URI alone, and add a new class e.g. java.net.IRI, which would implement the new RFC. To ease migration and possible future evolution, also introduce an abstract common ancestor to both classes, e.g. java.net.ResourceIdentifier (prototyped).
  7. Do not change the behavior of java.net.URI, but document its deviations towards RFC 3986 / RFC 3987 instead of documenting its deviation towards RFC 2396 / RFC 2732. Where possible, add new APIs to bring java.net.URI closer to RFC 3986 / RFC 3987. Only change the behavior of existing methods marginally, when regressions are unlikely (prototyped).

Prototypes

Prototype 1: Introducing a new public java.net.IRI class

This solution has been prototyped. The prototype is in a reasonable shape - but may still require more work. Here is a high level description of the prototype:

Performance and footprint considerations:

Testing:

Compatibility:

List of issues logged against java.net.URI that have been fixed in (or are no longer applicable to) java.net.IRI

Note: the following issues below should probably be closed as Won't Fix:

The following issues still need investigation/fixing

Prototype 2: Re-wording java.net.URI API documentation, adding new methods for new behaviors

This second solution has been prototyped too. The prototype is in a reasonable shape - but may still require more work. Here is a high level description of the prototype.

Claiming conformance to RFC 3986 / RFC 3987, even with deviations, still requires the addition of some new method in the API in order to make the claim acceptable. This prototype thus comprises the following:

Performance and footprint considerations:

Testing:

Compatibility:

Notes

Note on per-URI instances and combinatorial complexity in Solution 4

The idea there was to change java.net.URI to delegate everything to a wrapped instance of a package private URIImpl class - and have two implementations - one that is basically a clone of the current java.net.URI, and one other that implements the newer RFCs behaviors. Then we would have the old constructors and factories instantiate an instance of the old implementation, and add new factory methods to create instances of URI wrapping instances of the new implementation. A new public enum/boolean accessor could be added to figure out which flavor of the RFCs an instance of URI conforms to.
This idea was rejected for the following reasons:


Last modified: Tue Feb 19 16:13:02 GMT 2019