< prev index next >

src/java.base/share/classes/java/net/URI.java

Print this page




  66  * <h3> URI syntax and components </h3>
  67  *
  68  * At the highest level a URI reference (hereinafter simply "URI") in string
  69  * form has the syntax
  70  *
  71  * <blockquote>
  72  * [<i>scheme</i><b>{@code :}</b>]<i>scheme-specific-part</i>[<b>{@code #}</b><i>fragment</i>]
  73  * </blockquote>
  74  *
  75  * where square brackets [...] delineate optional components and the characters
  76  * <b>{@code :}</b> and <b>{@code #}</b> stand for themselves.
  77  *
  78  * <p> An <i>absolute</i> URI specifies a scheme; a URI that is not absolute is
  79  * said to be <i>relative</i>.  URIs are also classified according to whether
  80  * they are <i>opaque</i> or <i>hierarchical</i>.
  81  *
  82  * <p> An <i>opaque</i> URI is an absolute URI whose scheme-specific part does
  83  * not begin with a slash character ({@code '/'}).  Opaque URIs are not
  84  * subject to further parsing.  Some examples of opaque URIs are:
  85  *
  86  * <blockquote><table cellpadding=0 cellspacing=0 summary="layout">
  87  * <tr><td>{@code mailto:java-net@java.sun.com}<td></tr>
  88  * <tr><td>{@code news:comp.lang.java}<td></tr>
  89  * <tr><td>{@code urn:isbn:096139210x}</td></tr>
  90  * </table></blockquote>
  91  *
  92  * <p> A <i>hierarchical</i> URI is either an absolute URI whose
  93  * scheme-specific part begins with a slash character, or a relative URI, that
  94  * is, a URI that does not specify a scheme.  Some examples of hierarchical
  95  * URIs are:
  96  *
  97  * <blockquote>
  98  * {@code http://example.com/languages/java/}<br>
  99  * {@code sample/a/index.html#28}<br>
 100  * {@code ../../demo/b/index.html}<br>
 101  * {@code file:///~/calendar}
 102  * </blockquote>
 103  *
 104  * <p> A hierarchical URI is subject to further parsing according to the syntax
 105  *
 106  * <blockquote>
 107  * [<i>scheme</i><b>{@code :}</b>][<b>{@code //}</b><i>authority</i>][<i>path</i>][<b>{@code ?}</b><i>query</i>][<b>{@code #}</b><i>fragment</i>]
 108  * </blockquote>
 109  *
 110  * where the characters <b>{@code :}</b>, <b>{@code /}</b>,


 115  * <p> The authority component of a hierarchical URI is, if specified, either
 116  * <i>server-based</i> or <i>registry-based</i>.  A server-based authority
 117  * parses according to the familiar syntax
 118  *
 119  * <blockquote>
 120  * [<i>user-info</i><b>{@code @}</b>]<i>host</i>[<b>{@code :}</b><i>port</i>]
 121  * </blockquote>
 122  *
 123  * where the characters <b>{@code @}</b> and <b>{@code :}</b> stand for
 124  * themselves.  Nearly all URI schemes currently in use are server-based.  An
 125  * authority component that does not parse in this way is considered to be
 126  * registry-based.
 127  *
 128  * <p> The path component of a hierarchical URI is itself said to be absolute
 129  * if it begins with a slash character ({@code '/'}); otherwise it is
 130  * relative.  The path of a hierarchical URI that is either absolute or
 131  * specifies an authority is always absolute.
 132  *
 133  * <p> All told, then, a URI instance has the following nine components:
 134  *
 135  * <blockquote><table summary="Describes the components of a URI:scheme,scheme-specific-part,authority,user-info,host,port,path,query,fragment">


 136  * <tr><th><i>Component</i></th><th><i>Type</i></th></tr>


 137  * <tr><td>scheme</td><td>{@code String}</td></tr>
 138  * <tr><td>scheme-specific-part&nbsp;&nbsp;&nbsp;&nbsp;</td><td>{@code String}</td></tr>
 139  * <tr><td>authority</td><td>{@code String}</td></tr>
 140  * <tr><td>user-info</td><td>{@code String}</td></tr>
 141  * <tr><td>host</td><td>{@code String}</td></tr>
 142  * <tr><td>port</td><td>{@code int}</td></tr>
 143  * <tr><td>path</td><td>{@code String}</td></tr>
 144  * <tr><td>query</td><td>{@code String}</td></tr>
 145  * <tr><td>fragment</td><td>{@code String}</td></tr>

 146  * </table></blockquote>
 147  *
 148  * In a given instance any particular component is either <i>undefined</i> or
 149  * <i>defined</i> with a distinct value.  Undefined string components are
 150  * represented by {@code null}, while undefined integer components are
 151  * represented by {@code -1}.  A string component may be defined to have the
 152  * empty string as its value; this is not equivalent to that component being
 153  * undefined.
 154  *
 155  * <p> Whether a particular component is or is not defined in an instance
 156  * depends upon the type of the URI being represented.  An absolute URI has a
 157  * scheme component.  An opaque URI has a scheme, a scheme-specific part, and
 158  * possibly a fragment, but has no other components.  A hierarchical URI always
 159  * has a path (though it may be empty) and a scheme-specific-part (which at
 160  * least contains the path), and may have any of the other components.  If the
 161  * authority component is present and is server-based then the host component
 162  * will be defined and the user-information and port components may be defined.
 163  *
 164  *
 165  * <h4> Operations on URI instances </h4>


 231  * <blockquote>
 232  * {@code http://example.com/languages/java/sample/a/index.html#28}
 233  * </blockquote>
 234  *
 235  * against the base URI
 236  *
 237  * <blockquote>
 238  * {@code http://example.com/languages/java/}
 239  * </blockquote>
 240  *
 241  * yields the relative URI {@code sample/a/index.html#28}.
 242  *
 243  *
 244  * <h4> Character categories </h4>
 245  *
 246  * RFC&nbsp;2396 specifies precisely which characters are permitted in the
 247  * various components of a URI reference.  The following categories, most of
 248  * which are taken from that specification, are used below to describe these
 249  * constraints:
 250  *
 251  * <blockquote><table cellspacing=2 summary="Describes categories alpha,digit,alphanum,unreserved,punct,reserved,escaped,and other">


 252  *   <tr><th valign=top><i>alpha</i></th>
 253  *       <td>The US-ASCII alphabetic characters,
 254  *        {@code 'A'}&nbsp;through&nbsp;{@code 'Z'}
 255  *        and {@code 'a'}&nbsp;through&nbsp;{@code 'z'}</td></tr>
 256  *   <tr><th valign=top><i>digit</i></th>
 257  *       <td>The US-ASCII decimal digit characters,
 258  *       {@code '0'}&nbsp;through&nbsp;{@code '9'}</td></tr>
 259  *   <tr><th valign=top><i>alphanum</i></th>
 260  *       <td>All <i>alpha</i> and <i>digit</i> characters</td></tr>
 261  *   <tr><th valign=top><i>unreserved</i>&nbsp;&nbsp;&nbsp;&nbsp;</th>
 262  *       <td>All <i>alphanum</i> characters together with those in the string
 263  *        {@code "_-!.~'()*"}</td></tr>
 264  *   <tr><th valign=top><i>punct</i></th>
 265  *       <td>The characters in the string {@code ",;:$&+="}</td></tr>
 266  *   <tr><th valign=top><i>reserved</i></th>
 267  *       <td>All <i>punct</i> characters together with those in the string
 268  *        {@code "?/[]@"}</td></tr>
 269  *   <tr><th valign=top><i>escaped</i></th>
 270  *       <td>Escaped octets, that is, triplets consisting of the percent
 271  *           character ({@code '%'}) followed by two hexadecimal digits
 272  *           ({@code '0'}-{@code '9'}, {@code 'A'}-{@code 'F'}, and
 273  *           {@code 'a'}-{@code 'f'})</td></tr>
 274  *   <tr><th valign=top><i>other</i></th>
 275  *       <td>The Unicode characters that are not in the US-ASCII character set,
 276  *           are not control characters (according to the {@link
 277  *           java.lang.Character#isISOControl(char) Character.isISOControl}
 278  *           method), and are not space characters (according to the {@link
 279  *           java.lang.Character#isSpaceChar(char) Character.isSpaceChar}
 280  *           method)&nbsp;&nbsp;<i>(<b>Deviation from RFC 2396</b>, which is
 281  *           limited to US-ASCII)</i></td></tr>

 282  * </table></blockquote>
 283  *
 284  * <p><a id="legal-chars"></a> The set of all legal URI characters consists of
 285  * the <i>unreserved</i>, <i>reserved</i>, <i>escaped</i>, and <i>other</i>
 286  * characters.
 287  *
 288  *
 289  * <h4> Escaped octets, quotation, encoding, and decoding </h4>
 290  *
 291  * RFC 2396 allows escaped octets to appear in the user-info, path, query, and
 292  * fragment components.  Escaping serves two purposes in URIs:
 293  *
 294  * <ul>
 295  *
 296  *   <li><p> To <i>encode</i> non-US-ASCII characters when a URI is required to
 297  *   conform strictly to RFC&nbsp;2396 by not containing any <i>other</i>
 298  *   characters.  </p></li>
 299  *
 300  *   <li><p> To <i>quote</i> characters that are otherwise illegal in a
 301  *   component.  The user-info, path, query, and fragment components differ




  66  * <h3> URI syntax and components </h3>
  67  *
  68  * At the highest level a URI reference (hereinafter simply "URI") in string
  69  * form has the syntax
  70  *
  71  * <blockquote>
  72  * [<i>scheme</i><b>{@code :}</b>]<i>scheme-specific-part</i>[<b>{@code #}</b><i>fragment</i>]
  73  * </blockquote>
  74  *
  75  * where square brackets [...] delineate optional components and the characters
  76  * <b>{@code :}</b> and <b>{@code #}</b> stand for themselves.
  77  *
  78  * <p> An <i>absolute</i> URI specifies a scheme; a URI that is not absolute is
  79  * said to be <i>relative</i>.  URIs are also classified according to whether
  80  * they are <i>opaque</i> or <i>hierarchical</i>.
  81  *
  82  * <p> An <i>opaque</i> URI is an absolute URI whose scheme-specific part does
  83  * not begin with a slash character ({@code '/'}).  Opaque URIs are not
  84  * subject to further parsing.  Some examples of opaque URIs are:
  85  *
  86  * <blockquote><ul style="list-style-type:none">
  87  * <li>{@code mailto:java-net@java.sun.com}</li>
  88  * <li>{@code news:comp.lang.java}</li>
  89  * <li>{@code urn:isbn:096139210x}</li>
  90  * </ul></blockquote>
  91  *
  92  * <p> A <i>hierarchical</i> URI is either an absolute URI whose
  93  * scheme-specific part begins with a slash character, or a relative URI, that
  94  * is, a URI that does not specify a scheme.  Some examples of hierarchical
  95  * URIs are:
  96  *
  97  * <blockquote>
  98  * {@code http://example.com/languages/java/}<br>
  99  * {@code sample/a/index.html#28}<br>
 100  * {@code ../../demo/b/index.html}<br>
 101  * {@code file:///~/calendar}
 102  * </blockquote>
 103  *
 104  * <p> A hierarchical URI is subject to further parsing according to the syntax
 105  *
 106  * <blockquote>
 107  * [<i>scheme</i><b>{@code :}</b>][<b>{@code //}</b><i>authority</i>][<i>path</i>][<b>{@code ?}</b><i>query</i>][<b>{@code #}</b><i>fragment</i>]
 108  * </blockquote>
 109  *
 110  * where the characters <b>{@code :}</b>, <b>{@code /}</b>,


 115  * <p> The authority component of a hierarchical URI is, if specified, either
 116  * <i>server-based</i> or <i>registry-based</i>.  A server-based authority
 117  * parses according to the familiar syntax
 118  *
 119  * <blockquote>
 120  * [<i>user-info</i><b>{@code @}</b>]<i>host</i>[<b>{@code :}</b><i>port</i>]
 121  * </blockquote>
 122  *
 123  * where the characters <b>{@code @}</b> and <b>{@code :}</b> stand for
 124  * themselves.  Nearly all URI schemes currently in use are server-based.  An
 125  * authority component that does not parse in this way is considered to be
 126  * registry-based.
 127  *
 128  * <p> The path component of a hierarchical URI is itself said to be absolute
 129  * if it begins with a slash character ({@code '/'}); otherwise it is
 130  * relative.  The path of a hierarchical URI that is either absolute or
 131  * specifies an authority is always absolute.
 132  *
 133  * <p> All told, then, a URI instance has the following nine components:
 134  *
 135  * <blockquote><table class="borderless">
 136  * <caption style="display:none">Describes the components of a URI:scheme,scheme-specific-part,authority,user-info,host,port,path,query,fragment</caption>
 137  * <thead>
 138  * <tr><th><i>Component</i></th><th><i>Type</i></th></tr>
 139  * </thead>
 140  * <tbody>
 141  * <tr><td>scheme</td><td>{@code String}</td></tr>
 142  * <tr><td>scheme-specific-part&nbsp;&nbsp;&nbsp;&nbsp;</td><td>{@code String}</td></tr>
 143  * <tr><td>authority</td><td>{@code String}</td></tr>
 144  * <tr><td>user-info</td><td>{@code String}</td></tr>
 145  * <tr><td>host</td><td>{@code String}</td></tr>
 146  * <tr><td>port</td><td>{@code int}</td></tr>
 147  * <tr><td>path</td><td>{@code String}</td></tr>
 148  * <tr><td>query</td><td>{@code String}</td></tr>
 149  * <tr><td>fragment</td><td>{@code String}</td></tr>
 150  * </tbody>
 151  * </table></blockquote>
 152  *
 153  * In a given instance any particular component is either <i>undefined</i> or
 154  * <i>defined</i> with a distinct value.  Undefined string components are
 155  * represented by {@code null}, while undefined integer components are
 156  * represented by {@code -1}.  A string component may be defined to have the
 157  * empty string as its value; this is not equivalent to that component being
 158  * undefined.
 159  *
 160  * <p> Whether a particular component is or is not defined in an instance
 161  * depends upon the type of the URI being represented.  An absolute URI has a
 162  * scheme component.  An opaque URI has a scheme, a scheme-specific part, and
 163  * possibly a fragment, but has no other components.  A hierarchical URI always
 164  * has a path (though it may be empty) and a scheme-specific-part (which at
 165  * least contains the path), and may have any of the other components.  If the
 166  * authority component is present and is server-based then the host component
 167  * will be defined and the user-information and port components may be defined.
 168  *
 169  *
 170  * <h4> Operations on URI instances </h4>


 236  * <blockquote>
 237  * {@code http://example.com/languages/java/sample/a/index.html#28}
 238  * </blockquote>
 239  *
 240  * against the base URI
 241  *
 242  * <blockquote>
 243  * {@code http://example.com/languages/java/}
 244  * </blockquote>
 245  *
 246  * yields the relative URI {@code sample/a/index.html#28}.
 247  *
 248  *
 249  * <h4> Character categories </h4>
 250  *
 251  * RFC&nbsp;2396 specifies precisely which characters are permitted in the
 252  * various components of a URI reference.  The following categories, most of
 253  * which are taken from that specification, are used below to describe these
 254  * constraints:
 255  *
 256  * <blockquote><table>
 257  * <caption style="display:none">Describes categories alpha,digit,alphanum,unreserved,punct,reserved,escaped,and other</caption>
 258  *   <tbody>
 259  *   <tr><th valign=top><i>alpha</i></th>
 260  *       <td>The US-ASCII alphabetic characters,
 261  *        {@code 'A'}&nbsp;through&nbsp;{@code 'Z'}
 262  *        and {@code 'a'}&nbsp;through&nbsp;{@code 'z'}</td></tr>
 263  *   <tr><th valign=top><i>digit</i></th>
 264  *       <td>The US-ASCII decimal digit characters,
 265  *       {@code '0'}&nbsp;through&nbsp;{@code '9'}</td></tr>
 266  *   <tr><th valign=top><i>alphanum</i></th>
 267  *       <td>All <i>alpha</i> and <i>digit</i> characters</td></tr>
 268  *   <tr><th valign=top><i>unreserved</i>&nbsp;&nbsp;&nbsp;&nbsp;</th>
 269  *       <td>All <i>alphanum</i> characters together with those in the string
 270  *        {@code "_-!.~'()*"}</td></tr>
 271  *   <tr><th valign=top><i>punct</i></th>
 272  *       <td>The characters in the string {@code ",;:$&+="}</td></tr>
 273  *   <tr><th valign=top><i>reserved</i></th>
 274  *       <td>All <i>punct</i> characters together with those in the string
 275  *        {@code "?/[]@"}</td></tr>
 276  *   <tr><th valign=top><i>escaped</i></th>
 277  *       <td>Escaped octets, that is, triplets consisting of the percent
 278  *           character ({@code '%'}) followed by two hexadecimal digits
 279  *           ({@code '0'}-{@code '9'}, {@code 'A'}-{@code 'F'}, and
 280  *           {@code 'a'}-{@code 'f'})</td></tr>
 281  *   <tr><th valign=top><i>other</i></th>
 282  *       <td>The Unicode characters that are not in the US-ASCII character set,
 283  *           are not control characters (according to the {@link
 284  *           java.lang.Character#isISOControl(char) Character.isISOControl}
 285  *           method), and are not space characters (according to the {@link
 286  *           java.lang.Character#isSpaceChar(char) Character.isSpaceChar}
 287  *           method)&nbsp;&nbsp;<i>(<b>Deviation from RFC 2396</b>, which is
 288  *           limited to US-ASCII)</i></td></tr>
 289  * </tbody>
 290  * </table></blockquote>
 291  *
 292  * <p><a id="legal-chars"></a> The set of all legal URI characters consists of
 293  * the <i>unreserved</i>, <i>reserved</i>, <i>escaped</i>, and <i>other</i>
 294  * characters.
 295  *
 296  *
 297  * <h4> Escaped octets, quotation, encoding, and decoding </h4>
 298  *
 299  * RFC 2396 allows escaped octets to appear in the user-info, path, query, and
 300  * fragment components.  Escaping serves two purposes in URIs:
 301  *
 302  * <ul>
 303  *
 304  *   <li><p> To <i>encode</i> non-US-ASCII characters when a URI is required to
 305  *   conform strictly to RFC&nbsp;2396 by not containing any <i>other</i>
 306  *   characters.  </p></li>
 307  *
 308  *   <li><p> To <i>quote</i> characters that are otherwise illegal in a
 309  *   component.  The user-info, path, query, and fragment components differ


< prev index next >