< prev index next >

src/java.base/share/classes/java/util/regex/Pattern.java

Print this page




  62  * boolean b = m.{@link Matcher#matches matches}();</pre></blockquote>
  63  *
  64  * <p> A {@link #matches matches} method is defined by this class as a
  65  * convenience for when a regular expression is used just once.  This method
  66  * compiles an expression and matches an input sequence against it in a single
  67  * invocation.  The statement
  68  *
  69  * <blockquote><pre>
  70  * boolean b = Pattern.matches("a*b", "aaaaab");</pre></blockquote>
  71  *
  72  * is equivalent to the three statements above, though for repeated matches it
  73  * is less efficient since it does not allow the compiled pattern to be reused.
  74  *
  75  * <p> Instances of this class are immutable and are safe for use by multiple
  76  * concurrent threads.  Instances of the {@link Matcher} class are not safe for
  77  * such use.
  78  *
  79  *
  80  * <h3><a id="sum">Summary of regular-expression constructs</a></h3>
  81  *
  82  * <table border="0" cellpadding="1" cellspacing="0"
  83  *  summary="Regular expression constructs, and what they match">
  84  *
  85  * <tr style="text-align:left">
  86  * <th style="text-align:left" id="construct">Construct</th>
  87  * <th style="text-align:left" id="matches">Matches</th>
  88  * </tr>


  89  *
  90  * <tr><th>&nbsp;</th></tr>
  91  * <tr style="text-align:left"><th colspan="2" id="characters">Characters</th></tr>
  92  *
  93  * <tr><td style="vertical-align:top" headers="construct characters"><i>x</i></td>
  94  *     <td headers="matches">The character <i>x</i></td></tr>
  95  * <tr><td style="vertical-align:top" headers="construct characters">{@code \\}</td>
  96  *     <td headers="matches">The backslash character</td></tr>
  97  * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>n</i></td>
  98  *     <td headers="matches">The character with octal value {@code 0}<i>n</i>
  99  *         (0&nbsp;{@code <=}&nbsp;<i>n</i>&nbsp;{@code <=}&nbsp;7)</td></tr>
 100  * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>nn</i></td>
 101  *     <td headers="matches">The character with octal value {@code 0}<i>nn</i>
 102  *         (0&nbsp;{@code <=}&nbsp;<i>n</i>&nbsp;{@code <=}&nbsp;7)</td></tr>
 103  * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>mnn</i></td>
 104  *     <td headers="matches">The character with octal value {@code 0}<i>mnn</i>
 105  *         (0&nbsp;{@code <=}&nbsp;<i>m</i>&nbsp;{@code <=}&nbsp;3,
 106  *         0&nbsp;{@code <=}&nbsp;<i>n</i>&nbsp;{@code <=}&nbsp;7)</td></tr>
 107  * <tr><td style="vertical-align:top" headers="construct characters">{@code \x}<i>hh</i></td>
 108  *     <td headers="matches">The character with hexadecimal&nbsp;value&nbsp;{@code 0x}<i>hh</i></td></tr>


 366  *     <td headers="matches">Nothing, but turns match flags <a href="#CASE_INSENSITIVE">i</a>
 367  * <a href="#UNIX_LINES">d</a> <a href="#MULTILINE">m</a> <a href="#DOTALL">s</a>
 368  * <a href="#UNICODE_CASE">u</a> <a href="#COMMENTS">x</a> <a href="#UNICODE_CHARACTER_CLASS">U</a>
 369  * on - off</td></tr>
 370  * <tr><td style="vertical-align:top" headers="construct special"><code>(?idmsux-idmsux:</code><i>X</i>{@code )}&nbsp;&nbsp;</td>
 371  *     <td headers="matches"><i>X</i>, as a <a href="#cg">non-capturing group</a> with the
 372  *         given flags <a href="#CASE_INSENSITIVE">i</a> <a href="#UNIX_LINES">d</a>
 373  * <a href="#MULTILINE">m</a> <a href="#DOTALL">s</a> <a href="#UNICODE_CASE">u</a >
 374  * <a href="#COMMENTS">x</a> on - off</td></tr>
 375  * <tr><td style="vertical-align:top" headers="construct special">{@code (?=}<i>X</i>{@code )}</td>
 376  *     <td headers="matches"><i>X</i>, via zero-width positive lookahead</td></tr>
 377  * <tr><td style="vertical-align:top" headers="construct special">{@code (?!}<i>X</i>{@code )}</td>
 378  *     <td headers="matches"><i>X</i>, via zero-width negative lookahead</td></tr>
 379  * <tr><td style="vertical-align:top" headers="construct special">{@code (?<=}<i>X</i>{@code )}</td>
 380  *     <td headers="matches"><i>X</i>, via zero-width positive lookbehind</td></tr>
 381  * <tr><td style="vertical-align:top" headers="construct special">{@code (?<!}<i>X</i>{@code )}</td>
 382  *     <td headers="matches"><i>X</i>, via zero-width negative lookbehind</td></tr>
 383  * <tr><td style="vertical-align:top" headers="construct special">{@code (?>}<i>X</i>{@code )}</td>
 384  *     <td headers="matches"><i>X</i>, as an independent, non-capturing group</td></tr>
 385  *

 386  * </table>
 387  *
 388  * <hr>
 389  *
 390  *
 391  * <h3><a id="bs">Backslashes, escapes, and quoting</a></h3>
 392  *
 393  * <p> The backslash character ({@code '\'}) serves to introduce escaped
 394  * constructs, as defined in the table above, as well as to quote characters
 395  * that otherwise would be interpreted as unescaped constructs.  Thus the
 396  * expression {@code \\} matches a single backslash and <code>\{</code> matches a
 397  * left brace.
 398  *
 399  * <p> It is an error to use a backslash prior to any alphabetic character that
 400  * does not denote an escaped construct; these are reserved for future
 401  * extensions to the regular-expression language.  A backslash may be used
 402  * prior to a non-alphabetic character regardless of whether that character is
 403  * part of an unescaped construct.
 404  *
 405  * <p> Backslashes within string literals in Java source code are interpreted


 412  * <code>"\b"</code>, for example, matches a single backspace character when
 413  * interpreted as a regular expression, while {@code "\\b"} matches a
 414  * word boundary.  The string literal {@code "\(hello\)"} is illegal
 415  * and leads to a compile-time error; in order to match the string
 416  * {@code (hello)} the string literal {@code "\\(hello\\)"}
 417  * must be used.
 418  *
 419  * <h3><a id="cc">Character Classes</a></h3>
 420  *
 421  *    <p> Character classes may appear within other character classes, and
 422  *    may be composed by the union operator (implicit) and the intersection
 423  *    operator ({@code &&}).
 424  *    The union operator denotes a class that contains every character that is
 425  *    in at least one of its operand classes.  The intersection operator
 426  *    denotes a class that contains every character that is in both of its
 427  *    operand classes.
 428  *
 429  *    <p> The precedence of character-class operators is as follows, from
 430  *    highest to lowest:
 431  *
 432  *    <blockquote><table border="0" cellpadding="1" cellspacing="0"
 433  *                 summary="Precedence of character class operators.">

 434  *      <tr><th>1&nbsp;&nbsp;&nbsp;&nbsp;</th>
 435  *        <td>Literal escape&nbsp;&nbsp;&nbsp;&nbsp;</td>
 436  *        <td>{@code \x}</td></tr>
 437  *     <tr><th>2&nbsp;&nbsp;&nbsp;&nbsp;</th>
 438  *        <td>Grouping</td>
 439  *        <td>{@code [...]}</td></tr>
 440  *     <tr><th>3&nbsp;&nbsp;&nbsp;&nbsp;</th>
 441  *        <td>Range</td>
 442  *        <td>{@code a-z}</td></tr>
 443  *      <tr><th>4&nbsp;&nbsp;&nbsp;&nbsp;</th>
 444  *        <td>Union</td>
 445  *        <td>{@code [a-e][i-u]}</td></tr>
 446  *      <tr><th>5&nbsp;&nbsp;&nbsp;&nbsp;</th>
 447  *        <td>Intersection</td>
 448  *        <td>{@code [a-z&&[aeiou]]}</td></tr>

 449  *    </table></blockquote>
 450  *
 451  *    <p> Note that a different set of metacharacters are in effect inside
 452  *    a character class than outside a character class. For instance, the
 453  *    regular expression {@code .} loses its special meaning inside a
 454  *    character class, while the expression {@code -} becomes a range
 455  *    forming metacharacter.
 456  *
 457  * <h3><a id="lt">Line terminators</a></h3>
 458  *
 459  * <p> A <i>line terminator</i> is a one- or two-character sequence that marks
 460  * the end of a line of the input character sequence.  The following are
 461  * recognized as line terminators:
 462  *
 463  * <ul>
 464  *
 465  *   <li> A newline (line feed) character&nbsp;({@code '\n'}),
 466  *
 467  *   <li> A carriage-return character followed immediately by a newline
 468  *   character&nbsp;({@code "\r\n"}),


 479  * <p>If {@link #UNIX_LINES} mode is activated, then the only line terminators
 480  * recognized are newline characters.
 481  *
 482  * <p> The regular expression {@code .} matches any character except a line
 483  * terminator unless the {@link #DOTALL} flag is specified.
 484  *
 485  * <p> By default, the regular expressions {@code ^} and {@code $} ignore
 486  * line terminators and only match at the beginning and the end, respectively,
 487  * of the entire input sequence. If {@link #MULTILINE} mode is activated then
 488  * {@code ^} matches at the beginning of input and after any line terminator
 489  * except at the end of input. When in {@link #MULTILINE} mode {@code $}
 490  * matches just before a line terminator or the end of the input sequence.
 491  *
 492  * <h3><a id="cg">Groups and capturing</a></h3>
 493  *
 494  * <h4><a id="gnumber">Group number</a></h4>
 495  * <p> Capturing groups are numbered by counting their opening parentheses from
 496  * left to right.  In the expression {@code ((A)(B(C)))}, for example, there
 497  * are four such groups: </p>
 498  *
 499  * <blockquote><table cellpadding=1 cellspacing=0 summary="Capturing group numberings">


 500  * <tr><th>1&nbsp;&nbsp;&nbsp;&nbsp;</th>
 501  *     <td>{@code ((A)(B(C)))}</td></tr>
 502  * <tr><th>2&nbsp;&nbsp;&nbsp;&nbsp;</th>
 503  *     <td>{@code (A)}</td></tr>
 504  * <tr><th>3&nbsp;&nbsp;&nbsp;&nbsp;</th>
 505  *     <td>{@code (B(C))}</td></tr>
 506  * <tr><th>4&nbsp;&nbsp;&nbsp;&nbsp;</th>
 507  *     <td>{@code (C)}</td></tr>

 508  * </table></blockquote>
 509  *
 510  * <p> Group zero always stands for the entire expression.
 511  *
 512  * <p> Capturing groups are so named because, during a match, each subsequence
 513  * of the input sequence that matches such a group is saved.  The captured
 514  * subsequence may be used later in the expression, via a back reference, and
 515  * may also be retrieved from the matcher once the match operation is complete.
 516  *
 517  * <h4><a id="groupname">Group name</a></h4>
 518  * <p>A capturing group can also be assigned a "name", a {@code named-capturing group},
 519  * and then be back-referenced later by the "name". Group names are composed of
 520  * the following characters. The first character must be a {@code letter}.
 521  *
 522  * <ul>
 523  *   <li> The uppercase letters {@code 'A'} through {@code 'Z'}
 524  *        (<code>'\u0041'</code>&nbsp;through&nbsp;<code>'\u005a'</code>),
 525  *   <li> The lowercase letters {@code 'a'} through {@code 'z'}
 526  *        (<code>'\u0061'</code>&nbsp;through&nbsp;<code>'\u007a'</code>),
 527  *   <li> The digits {@code '0'} through {@code '9'}


 624  *   <li> Ideographic
 625  *   <li> Letter
 626  *   <li> Lowercase
 627  *   <li> Uppercase
 628  *   <li> Titlecase
 629  *   <li> Punctuation
 630  *   <Li> Control
 631  *   <li> White_Space
 632  *   <li> Digit
 633  *   <li> Hex_Digit
 634  *   <li> Join_Control
 635  *   <li> Noncharacter_Code_Point
 636  *   <li> Assigned
 637  * </ul>
 638  * <p>
 639  * The following <b>Predefined Character classes</b> and <b>POSIX character classes</b>
 640  * are in conformance with the recommendation of <i>Annex C: Compatibility Properties</i>
 641  * of <a href="http://www.unicode.org/reports/tr18/"><i>Unicode Regular Expression
 642  * </i></a>, when {@link #UNICODE_CHARACTER_CLASS} flag is specified.
 643  *
 644  * <table border="0" cellpadding="1" cellspacing="0"
 645  *  summary="predefined and posix character classes in Unicode mode">

 646  * <tr style="text-align:left">
 647  * <th style="text-align:left" id="predef_classes">Classes</th>
 648  * <th style="text-align:left" id="predef_matches">Matches</th>
 649  *</tr>


 650  * <tr><td>{@code \p{Lower}}</td>
 651  *     <td>A lowercase character:{@code \p{IsLowercase}}</td></tr>
 652  * <tr><td>{@code \p{Upper}}</td>
 653  *     <td>An uppercase character:{@code \p{IsUppercase}}</td></tr>
 654  * <tr><td>{@code \p{ASCII}}</td>
 655  *     <td>All ASCII:{@code [\x00-\x7F]}</td></tr>
 656  * <tr><td>{@code \p{Alpha}}</td>
 657  *     <td>An alphabetic character:{@code \p{IsAlphabetic}}</td></tr>
 658  * <tr><td>{@code \p{Digit}}</td>
 659  *     <td>A decimal digit character:{@code p{IsDigit}}</td></tr>
 660  * <tr><td>{@code \p{Alnum}}</td>
 661  *     <td>An alphanumeric character:{@code [\p{IsAlphabetic}\p{IsDigit}]}</td></tr>
 662  * <tr><td>{@code \p{Punct}}</td>
 663  *     <td>A punctuation character:{@code p{IsPunctuation}}</td></tr>
 664  * <tr><td>{@code \p{Graph}}</td>
 665  *     <td>A visible character: {@code [^\p{IsWhite_Space}\p{gc=Cc}\p{gc=Cs}\p{gc=Cn}]}</td></tr>
 666  * <tr><td>{@code \p{Print}}</td>
 667  *     <td>A printable character: {@code [\p{Graph}\p{Blank}&&[^\p{Cntrl}]]}</td></tr>
 668  * <tr><td>{@code \p{Blank}}</td>
 669  *     <td>A space or a tab: {@code [\p{IsWhite_Space}&&[^\p{gc=Zl}\p{gc=Zp}\x0a\x0b\x0c\x0d\x85]]}</td></tr>
 670  * <tr><td>{@code \p{Cntrl}}</td>
 671  *     <td>A control character: {@code \p{gc=Cc}}</td></tr>
 672  * <tr><td>{@code \p{XDigit}}</td>
 673  *     <td>A hexadecimal digit: {@code [\p{gc=Nd}\p{IsHex_Digit}]}</td></tr>
 674  * <tr><td>{@code \p{Space}}</td>
 675  *     <td>A whitespace character:{@code \p{IsWhite_Space}}</td></tr>
 676  * <tr><td>{@code \d}</td>
 677  *     <td>A digit: {@code \p{IsDigit}}</td></tr>
 678  * <tr><td>{@code \D}</td>
 679  *     <td>A non-digit: {@code [^\d]}</td></tr>
 680  * <tr><td>{@code \s}</td>
 681  *     <td>A whitespace character: {@code \p{IsWhite_Space}}</td></tr>
 682  * <tr><td>{@code \S}</td>
 683  *     <td>A non-whitespace character: {@code [^\s]}</td></tr>
 684  * <tr><td>{@code \w}</td>
 685  *     <td>A word character: {@code [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]}</td></tr>
 686  * <tr><td>{@code \W}</td>
 687  *     <td>A non-word character: {@code [^\w]}</td></tr>

 688  * </table>
 689  * <p>
 690  * <a id="jcc">
 691  * Categories that behave like the java.lang.Character
 692  * boolean is<i>methodname</i> methods (except for the deprecated ones) are
 693  * available through the same <code>\p{</code><i>prop</i><code>}</code> syntax where
 694  * the specified property has the name <code>java<i>methodname</i></code></a>.
 695  *
 696  * <h3> Comparison to Perl 5 </h3>
 697  *
 698  * <p>The {@code Pattern} engine performs traditional NFA-based matching
 699  * with ordered alternation as occurs in Perl 5.
 700  *
 701  * <p> Perl constructs not supported by this class: </p>
 702  *
 703  * <ul>
 704  *    <li><p> The backreference constructs, <code>\g{</code><i>n</i><code>}</code> for
 705  *    the <i>n</i><sup>th</sup><a href="#cg">capturing group</a> and
 706  *    <code>\g{</code><i>name</i><code>}</code> for
 707  *    <a href="#groupname">named-capturing group</a>.


1190      *
1191      * <p> When there is a positive-width match at the beginning of the input
1192      * sequence then an empty leading substring is included at the beginning
1193      * of the resulting array. A zero-width match at the beginning however
1194      * never produces such empty leading substring.
1195      *
1196      * <p> The {@code limit} parameter controls the number of times the
1197      * pattern is applied and therefore affects the length of the resulting
1198      * array.  If the limit <i>n</i> is greater than zero then the pattern
1199      * will be applied at most <i>n</i>&nbsp;-&nbsp;1 times, the array's
1200      * length will be no greater than <i>n</i>, and the array's last entry
1201      * will contain all input beyond the last matched delimiter.  If <i>n</i>
1202      * is non-positive then the pattern will be applied as many times as
1203      * possible and the array can have any length.  If <i>n</i> is zero then
1204      * the pattern will be applied as many times as possible, the array can
1205      * have any length, and trailing empty strings will be discarded.
1206      *
1207      * <p> The input {@code "boo:and:foo"}, for example, yields the following
1208      * results with these parameters:
1209      *
1210      * <blockquote><table cellpadding=1 cellspacing=0
1211      *              summary="Split examples showing regex, limit, and result">

1212      * <tr><th style="text-align:left"><i>Regex&nbsp;&nbsp;&nbsp;&nbsp;</i></th>
1213      *     <th style="text-align:left"><i>Limit&nbsp;&nbsp;&nbsp;&nbsp;</i></th>
1214      *     <th style="text-align:left"><i>Result&nbsp;&nbsp;&nbsp;&nbsp;</i></th></tr>


1215      * <tr><td style="text-align:center">:</td>
1216      *     <td style="text-align:center">2</td>
1217      *     <td>{@code { "boo", "and:foo" }}</td></tr>
1218      * <tr><td style="text-align:center">:</td>
1219      *     <td style="text-align:center">5</td>
1220      *     <td>{@code { "boo", "and", "foo" }}</td></tr>
1221      * <tr><td style="text-align:center">:</td>
1222      *     <td style="text-align:center">-2</td>
1223      *     <td>{@code { "boo", "and", "foo" }}</td></tr>
1224      * <tr><td style="text-align:center">o</td>
1225      *     <td style="text-align:center">5</td>
1226      *     <td>{@code { "b", "", ":and:f", "", "" }}</td></tr>
1227      * <tr><td style="text-align:center">o</td>
1228      *     <td style="text-align:center">-2</td>
1229      *     <td>{@code { "b", "", ":and:f", "", "" }}</td></tr>
1230      * <tr><td style="text-align:center">o</td>
1231      *     <td style="text-align:center">0</td>
1232      *     <td>{@code { "b", "", ":and:f" }}</td></tr>

1233      * </table></blockquote>
1234      *
1235      * @param  input
1236      *         The character sequence to be split
1237      *
1238      * @param  limit
1239      *         The result threshold, as described above
1240      *
1241      * @return  The array of strings computed by splitting the input
1242      *          around matches of this pattern
1243      */
1244     public String[] split(CharSequence input, int limit) {
1245         int index = 0;
1246         boolean matchLimited = limit > 0;
1247         ArrayList<String> matchList = new ArrayList<>();
1248         Matcher m = matcher(input);
1249 
1250         // Add segments before each match found
1251         while(m.find()) {
1252             if (!matchLimited || matchList.size() < limit - 1) {


1277         // Construct result
1278         int resultSize = matchList.size();
1279         if (limit == 0)
1280             while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
1281                 resultSize--;
1282         String[] result = new String[resultSize];
1283         return matchList.subList(0, resultSize).toArray(result);
1284     }
1285 
1286     /**
1287      * Splits the given input sequence around matches of this pattern.
1288      *
1289      * <p> This method works as if by invoking the two-argument {@link
1290      * #split(java.lang.CharSequence, int) split} method with the given input
1291      * sequence and a limit argument of zero.  Trailing empty strings are
1292      * therefore not included in the resulting array. </p>
1293      *
1294      * <p> The input {@code "boo:and:foo"}, for example, yields the following
1295      * results with these expressions:
1296      *
1297      * <blockquote><table cellpadding=1 cellspacing=0
1298      *              summary="Split examples showing regex and result">

1299      * <tr><th style="text-align:left"><i>Regex&nbsp;&nbsp;&nbsp;&nbsp;</i></th>
1300      *     <th style="text-align:left"><i>Result</i></th></tr>


1301      * <tr><td style="text-align:center">:</td>
1302      *     <td>{@code { "boo", "and", "foo" }}</td></tr>
1303      * <tr><td style="text-align:center">o</td>
1304      *     <td>{@code { "b", "", ":and:f" }}</td></tr>

1305      * </table></blockquote>
1306      *
1307      *
1308      * @param  input
1309      *         The character sequence to be split
1310      *
1311      * @return  The array of strings computed by splitting the input
1312      *          around matches of this pattern
1313      */
1314     public String[] split(CharSequence input) {
1315         return split(input, 0);
1316     }
1317 
1318     /**
1319      * Returns a literal pattern {@code String} for the specified
1320      * {@code String}.
1321      *
1322      * <p>This method produces a {@code String} that can be used to
1323      * create a {@code Pattern} that would match the string
1324      * {@code s} as if it were a literal pattern.</p> Metacharacters




  62  * boolean b = m.{@link Matcher#matches matches}();</pre></blockquote>
  63  *
  64  * <p> A {@link #matches matches} method is defined by this class as a
  65  * convenience for when a regular expression is used just once.  This method
  66  * compiles an expression and matches an input sequence against it in a single
  67  * invocation.  The statement
  68  *
  69  * <blockquote><pre>
  70  * boolean b = Pattern.matches("a*b", "aaaaab");</pre></blockquote>
  71  *
  72  * is equivalent to the three statements above, though for repeated matches it
  73  * is less efficient since it does not allow the compiled pattern to be reused.
  74  *
  75  * <p> Instances of this class are immutable and are safe for use by multiple
  76  * concurrent threads.  Instances of the {@link Matcher} class are not safe for
  77  * such use.
  78  *
  79  *
  80  * <h3><a id="sum">Summary of regular-expression constructs</a></h3>
  81  *
  82  * <table class="borderless">
  83  * <caption style="display:none">Regular expression constructs, and what they match</caption>
  84  * <thead>
  85  * <tr style="text-align:left">
  86  * <th style="text-align:left" id="construct">Construct</th>
  87  * <th style="text-align:left" id="matches">Matches</th>
  88  * </tr>
  89  * </thead>
  90  * <tbody>
  91  *
  92  * <tr><th>&nbsp;</th></tr>
  93  * <tr style="text-align:left"><th colspan="2" id="characters">Characters</th></tr>
  94  *
  95  * <tr><td style="vertical-align:top" headers="construct characters"><i>x</i></td>
  96  *     <td headers="matches">The character <i>x</i></td></tr>
  97  * <tr><td style="vertical-align:top" headers="construct characters">{@code \\}</td>
  98  *     <td headers="matches">The backslash character</td></tr>
  99  * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>n</i></td>
 100  *     <td headers="matches">The character with octal value {@code 0}<i>n</i>
 101  *         (0&nbsp;{@code <=}&nbsp;<i>n</i>&nbsp;{@code <=}&nbsp;7)</td></tr>
 102  * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>nn</i></td>
 103  *     <td headers="matches">The character with octal value {@code 0}<i>nn</i>
 104  *         (0&nbsp;{@code <=}&nbsp;<i>n</i>&nbsp;{@code <=}&nbsp;7)</td></tr>
 105  * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>mnn</i></td>
 106  *     <td headers="matches">The character with octal value {@code 0}<i>mnn</i>
 107  *         (0&nbsp;{@code <=}&nbsp;<i>m</i>&nbsp;{@code <=}&nbsp;3,
 108  *         0&nbsp;{@code <=}&nbsp;<i>n</i>&nbsp;{@code <=}&nbsp;7)</td></tr>
 109  * <tr><td style="vertical-align:top" headers="construct characters">{@code \x}<i>hh</i></td>
 110  *     <td headers="matches">The character with hexadecimal&nbsp;value&nbsp;{@code 0x}<i>hh</i></td></tr>


 368  *     <td headers="matches">Nothing, but turns match flags <a href="#CASE_INSENSITIVE">i</a>
 369  * <a href="#UNIX_LINES">d</a> <a href="#MULTILINE">m</a> <a href="#DOTALL">s</a>
 370  * <a href="#UNICODE_CASE">u</a> <a href="#COMMENTS">x</a> <a href="#UNICODE_CHARACTER_CLASS">U</a>
 371  * on - off</td></tr>
 372  * <tr><td style="vertical-align:top" headers="construct special"><code>(?idmsux-idmsux:</code><i>X</i>{@code )}&nbsp;&nbsp;</td>
 373  *     <td headers="matches"><i>X</i>, as a <a href="#cg">non-capturing group</a> with the
 374  *         given flags <a href="#CASE_INSENSITIVE">i</a> <a href="#UNIX_LINES">d</a>
 375  * <a href="#MULTILINE">m</a> <a href="#DOTALL">s</a> <a href="#UNICODE_CASE">u</a >
 376  * <a href="#COMMENTS">x</a> on - off</td></tr>
 377  * <tr><td style="vertical-align:top" headers="construct special">{@code (?=}<i>X</i>{@code )}</td>
 378  *     <td headers="matches"><i>X</i>, via zero-width positive lookahead</td></tr>
 379  * <tr><td style="vertical-align:top" headers="construct special">{@code (?!}<i>X</i>{@code )}</td>
 380  *     <td headers="matches"><i>X</i>, via zero-width negative lookahead</td></tr>
 381  * <tr><td style="vertical-align:top" headers="construct special">{@code (?<=}<i>X</i>{@code )}</td>
 382  *     <td headers="matches"><i>X</i>, via zero-width positive lookbehind</td></tr>
 383  * <tr><td style="vertical-align:top" headers="construct special">{@code (?<!}<i>X</i>{@code )}</td>
 384  *     <td headers="matches"><i>X</i>, via zero-width negative lookbehind</td></tr>
 385  * <tr><td style="vertical-align:top" headers="construct special">{@code (?>}<i>X</i>{@code )}</td>
 386  *     <td headers="matches"><i>X</i>, as an independent, non-capturing group</td></tr>
 387  *
 388  * </tbody>
 389  * </table>
 390  *
 391  * <hr>
 392  *
 393  *
 394  * <h3><a id="bs">Backslashes, escapes, and quoting</a></h3>
 395  *
 396  * <p> The backslash character ({@code '\'}) serves to introduce escaped
 397  * constructs, as defined in the table above, as well as to quote characters
 398  * that otherwise would be interpreted as unescaped constructs.  Thus the
 399  * expression {@code \\} matches a single backslash and <code>\{</code> matches a
 400  * left brace.
 401  *
 402  * <p> It is an error to use a backslash prior to any alphabetic character that
 403  * does not denote an escaped construct; these are reserved for future
 404  * extensions to the regular-expression language.  A backslash may be used
 405  * prior to a non-alphabetic character regardless of whether that character is
 406  * part of an unescaped construct.
 407  *
 408  * <p> Backslashes within string literals in Java source code are interpreted


 415  * <code>"\b"</code>, for example, matches a single backspace character when
 416  * interpreted as a regular expression, while {@code "\\b"} matches a
 417  * word boundary.  The string literal {@code "\(hello\)"} is illegal
 418  * and leads to a compile-time error; in order to match the string
 419  * {@code (hello)} the string literal {@code "\\(hello\\)"}
 420  * must be used.
 421  *
 422  * <h3><a id="cc">Character Classes</a></h3>
 423  *
 424  *    <p> Character classes may appear within other character classes, and
 425  *    may be composed by the union operator (implicit) and the intersection
 426  *    operator ({@code &&}).
 427  *    The union operator denotes a class that contains every character that is
 428  *    in at least one of its operand classes.  The intersection operator
 429  *    denotes a class that contains every character that is in both of its
 430  *    operand classes.
 431  *
 432  *    <p> The precedence of character-class operators is as follows, from
 433  *    highest to lowest:
 434  *
 435  *    <blockquote><table>
 436  *      <caption style="display:none">Precedence of character class operators.</caption>
 437  *      <tbody>
 438  *      <tr><th>1&nbsp;&nbsp;&nbsp;&nbsp;</th>
 439  *        <td>Literal escape&nbsp;&nbsp;&nbsp;&nbsp;</td>
 440  *        <td>{@code \x}</td></tr>
 441  *     <tr><th>2&nbsp;&nbsp;&nbsp;&nbsp;</th>
 442  *        <td>Grouping</td>
 443  *        <td>{@code [...]}</td></tr>
 444  *     <tr><th>3&nbsp;&nbsp;&nbsp;&nbsp;</th>
 445  *        <td>Range</td>
 446  *        <td>{@code a-z}</td></tr>
 447  *      <tr><th>4&nbsp;&nbsp;&nbsp;&nbsp;</th>
 448  *        <td>Union</td>
 449  *        <td>{@code [a-e][i-u]}</td></tr>
 450  *      <tr><th>5&nbsp;&nbsp;&nbsp;&nbsp;</th>
 451  *        <td>Intersection</td>
 452  *        <td>{@code [a-z&&[aeiou]]}</td></tr>
 453  *      </tbody>
 454  *    </table></blockquote>
 455  *
 456  *    <p> Note that a different set of metacharacters are in effect inside
 457  *    a character class than outside a character class. For instance, the
 458  *    regular expression {@code .} loses its special meaning inside a
 459  *    character class, while the expression {@code -} becomes a range
 460  *    forming metacharacter.
 461  *
 462  * <h3><a id="lt">Line terminators</a></h3>
 463  *
 464  * <p> A <i>line terminator</i> is a one- or two-character sequence that marks
 465  * the end of a line of the input character sequence.  The following are
 466  * recognized as line terminators:
 467  *
 468  * <ul>
 469  *
 470  *   <li> A newline (line feed) character&nbsp;({@code '\n'}),
 471  *
 472  *   <li> A carriage-return character followed immediately by a newline
 473  *   character&nbsp;({@code "\r\n"}),


 484  * <p>If {@link #UNIX_LINES} mode is activated, then the only line terminators
 485  * recognized are newline characters.
 486  *
 487  * <p> The regular expression {@code .} matches any character except a line
 488  * terminator unless the {@link #DOTALL} flag is specified.
 489  *
 490  * <p> By default, the regular expressions {@code ^} and {@code $} ignore
 491  * line terminators and only match at the beginning and the end, respectively,
 492  * of the entire input sequence. If {@link #MULTILINE} mode is activated then
 493  * {@code ^} matches at the beginning of input and after any line terminator
 494  * except at the end of input. When in {@link #MULTILINE} mode {@code $}
 495  * matches just before a line terminator or the end of the input sequence.
 496  *
 497  * <h3><a id="cg">Groups and capturing</a></h3>
 498  *
 499  * <h4><a id="gnumber">Group number</a></h4>
 500  * <p> Capturing groups are numbered by counting their opening parentheses from
 501  * left to right.  In the expression {@code ((A)(B(C)))}, for example, there
 502  * are four such groups: </p>
 503  *
 504  * <blockquote><table>
 505  * <caption style="display:none">Capturing group numberings</caption>
 506  * <tbody>
 507  * <tr><th>1&nbsp;&nbsp;&nbsp;&nbsp;</th>
 508  *     <td>{@code ((A)(B(C)))}</td></tr>
 509  * <tr><th>2&nbsp;&nbsp;&nbsp;&nbsp;</th>
 510  *     <td>{@code (A)}</td></tr>
 511  * <tr><th>3&nbsp;&nbsp;&nbsp;&nbsp;</th>
 512  *     <td>{@code (B(C))}</td></tr>
 513  * <tr><th>4&nbsp;&nbsp;&nbsp;&nbsp;</th>
 514  *     <td>{@code (C)}</td></tr>
 515  * </tbody>
 516  * </table></blockquote>
 517  *
 518  * <p> Group zero always stands for the entire expression.
 519  *
 520  * <p> Capturing groups are so named because, during a match, each subsequence
 521  * of the input sequence that matches such a group is saved.  The captured
 522  * subsequence may be used later in the expression, via a back reference, and
 523  * may also be retrieved from the matcher once the match operation is complete.
 524  *
 525  * <h4><a id="groupname">Group name</a></h4>
 526  * <p>A capturing group can also be assigned a "name", a {@code named-capturing group},
 527  * and then be back-referenced later by the "name". Group names are composed of
 528  * the following characters. The first character must be a {@code letter}.
 529  *
 530  * <ul>
 531  *   <li> The uppercase letters {@code 'A'} through {@code 'Z'}
 532  *        (<code>'\u0041'</code>&nbsp;through&nbsp;<code>'\u005a'</code>),
 533  *   <li> The lowercase letters {@code 'a'} through {@code 'z'}
 534  *        (<code>'\u0061'</code>&nbsp;through&nbsp;<code>'\u007a'</code>),
 535  *   <li> The digits {@code '0'} through {@code '9'}


 632  *   <li> Ideographic
 633  *   <li> Letter
 634  *   <li> Lowercase
 635  *   <li> Uppercase
 636  *   <li> Titlecase
 637  *   <li> Punctuation
 638  *   <Li> Control
 639  *   <li> White_Space
 640  *   <li> Digit
 641  *   <li> Hex_Digit
 642  *   <li> Join_Control
 643  *   <li> Noncharacter_Code_Point
 644  *   <li> Assigned
 645  * </ul>
 646  * <p>
 647  * The following <b>Predefined Character classes</b> and <b>POSIX character classes</b>
 648  * are in conformance with the recommendation of <i>Annex C: Compatibility Properties</i>
 649  * of <a href="http://www.unicode.org/reports/tr18/"><i>Unicode Regular Expression
 650  * </i></a>, when {@link #UNICODE_CHARACTER_CLASS} flag is specified.
 651  *
 652  * <table>
 653  * <caption style="display:none">predefined and posix character classes in Unicode mode</caption>
 654  * <thead>
 655  * <tr style="text-align:left">
 656  * <th style="text-align:left" id="predef_classes">Classes</th>
 657  * <th style="text-align:left" id="predef_matches">Matches</th>
 658  * </tr>
 659  * </thead>
 660  * <tbody>
 661  * <tr><td>{@code \p{Lower}}</td>
 662  *     <td>A lowercase character:{@code \p{IsLowercase}}</td></tr>
 663  * <tr><td>{@code \p{Upper}}</td>
 664  *     <td>An uppercase character:{@code \p{IsUppercase}}</td></tr>
 665  * <tr><td>{@code \p{ASCII}}</td>
 666  *     <td>All ASCII:{@code [\x00-\x7F]}</td></tr>
 667  * <tr><td>{@code \p{Alpha}}</td>
 668  *     <td>An alphabetic character:{@code \p{IsAlphabetic}}</td></tr>
 669  * <tr><td>{@code \p{Digit}}</td>
 670  *     <td>A decimal digit character:{@code p{IsDigit}}</td></tr>
 671  * <tr><td>{@code \p{Alnum}}</td>
 672  *     <td>An alphanumeric character:{@code [\p{IsAlphabetic}\p{IsDigit}]}</td></tr>
 673  * <tr><td>{@code \p{Punct}}</td>
 674  *     <td>A punctuation character:{@code p{IsPunctuation}}</td></tr>
 675  * <tr><td>{@code \p{Graph}}</td>
 676  *     <td>A visible character: {@code [^\p{IsWhite_Space}\p{gc=Cc}\p{gc=Cs}\p{gc=Cn}]}</td></tr>
 677  * <tr><td>{@code \p{Print}}</td>
 678  *     <td>A printable character: {@code [\p{Graph}\p{Blank}&&[^\p{Cntrl}]]}</td></tr>
 679  * <tr><td>{@code \p{Blank}}</td>
 680  *     <td>A space or a tab: {@code [\p{IsWhite_Space}&&[^\p{gc=Zl}\p{gc=Zp}\x0a\x0b\x0c\x0d\x85]]}</td></tr>
 681  * <tr><td>{@code \p{Cntrl}}</td>
 682  *     <td>A control character: {@code \p{gc=Cc}}</td></tr>
 683  * <tr><td>{@code \p{XDigit}}</td>
 684  *     <td>A hexadecimal digit: {@code [\p{gc=Nd}\p{IsHex_Digit}]}</td></tr>
 685  * <tr><td>{@code \p{Space}}</td>
 686  *     <td>A whitespace character:{@code \p{IsWhite_Space}}</td></tr>
 687  * <tr><td>{@code \d}</td>
 688  *     <td>A digit: {@code \p{IsDigit}}</td></tr>
 689  * <tr><td>{@code \D}</td>
 690  *     <td>A non-digit: {@code [^\d]}</td></tr>
 691  * <tr><td>{@code \s}</td>
 692  *     <td>A whitespace character: {@code \p{IsWhite_Space}}</td></tr>
 693  * <tr><td>{@code \S}</td>
 694  *     <td>A non-whitespace character: {@code [^\s]}</td></tr>
 695  * <tr><td>{@code \w}</td>
 696  *     <td>A word character: {@code [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]}</td></tr>
 697  * <tr><td>{@code \W}</td>
 698  *     <td>A non-word character: {@code [^\w]}</td></tr>
 699  * </tbody>
 700  * </table>
 701  * <p>
 702  * <a id="jcc">
 703  * Categories that behave like the java.lang.Character
 704  * boolean is<i>methodname</i> methods (except for the deprecated ones) are
 705  * available through the same <code>\p{</code><i>prop</i><code>}</code> syntax where
 706  * the specified property has the name <code>java<i>methodname</i></code></a>.
 707  *
 708  * <h3> Comparison to Perl 5 </h3>
 709  *
 710  * <p>The {@code Pattern} engine performs traditional NFA-based matching
 711  * with ordered alternation as occurs in Perl 5.
 712  *
 713  * <p> Perl constructs not supported by this class: </p>
 714  *
 715  * <ul>
 716  *    <li><p> The backreference constructs, <code>\g{</code><i>n</i><code>}</code> for
 717  *    the <i>n</i><sup>th</sup><a href="#cg">capturing group</a> and
 718  *    <code>\g{</code><i>name</i><code>}</code> for
 719  *    <a href="#groupname">named-capturing group</a>.


1202      *
1203      * <p> When there is a positive-width match at the beginning of the input
1204      * sequence then an empty leading substring is included at the beginning
1205      * of the resulting array. A zero-width match at the beginning however
1206      * never produces such empty leading substring.
1207      *
1208      * <p> The {@code limit} parameter controls the number of times the
1209      * pattern is applied and therefore affects the length of the resulting
1210      * array.  If the limit <i>n</i> is greater than zero then the pattern
1211      * will be applied at most <i>n</i>&nbsp;-&nbsp;1 times, the array's
1212      * length will be no greater than <i>n</i>, and the array's last entry
1213      * will contain all input beyond the last matched delimiter.  If <i>n</i>
1214      * is non-positive then the pattern will be applied as many times as
1215      * possible and the array can have any length.  If <i>n</i> is zero then
1216      * the pattern will be applied as many times as possible, the array can
1217      * have any length, and trailing empty strings will be discarded.
1218      *
1219      * <p> The input {@code "boo:and:foo"}, for example, yields the following
1220      * results with these parameters:
1221      *
1222      * <blockquote><table>
1223      * <caption>Split examples showing regex, limit, and result</caption>
1224      * <thead>
1225      * <tr><th style="text-align:left"><i>Regex&nbsp;&nbsp;&nbsp;&nbsp;</i></th>
1226      *     <th style="text-align:left"><i>Limit&nbsp;&nbsp;&nbsp;&nbsp;</i></th>
1227      *     <th style="text-align:left"><i>Result&nbsp;&nbsp;&nbsp;&nbsp;</i></th></tr>
1228      * </thead>
1229      * <tbody>
1230      * <tr><td style="text-align:center">:</td>
1231      *     <td style="text-align:center">2</td>
1232      *     <td>{@code { "boo", "and:foo" }}</td></tr>
1233      * <tr><td style="text-align:center">:</td>
1234      *     <td style="text-align:center">5</td>
1235      *     <td>{@code { "boo", "and", "foo" }}</td></tr>
1236      * <tr><td style="text-align:center">:</td>
1237      *     <td style="text-align:center">-2</td>
1238      *     <td>{@code { "boo", "and", "foo" }}</td></tr>
1239      * <tr><td style="text-align:center">o</td>
1240      *     <td style="text-align:center">5</td>
1241      *     <td>{@code { "b", "", ":and:f", "", "" }}</td></tr>
1242      * <tr><td style="text-align:center">o</td>
1243      *     <td style="text-align:center">-2</td>
1244      *     <td>{@code { "b", "", ":and:f", "", "" }}</td></tr>
1245      * <tr><td style="text-align:center">o</td>
1246      *     <td style="text-align:center">0</td>
1247      *     <td>{@code { "b", "", ":and:f" }}</td></tr>
1248      * </tbody>
1249      * </table></blockquote>
1250      *
1251      * @param  input
1252      *         The character sequence to be split
1253      *
1254      * @param  limit
1255      *         The result threshold, as described above
1256      *
1257      * @return  The array of strings computed by splitting the input
1258      *          around matches of this pattern
1259      */
1260     public String[] split(CharSequence input, int limit) {
1261         int index = 0;
1262         boolean matchLimited = limit > 0;
1263         ArrayList<String> matchList = new ArrayList<>();
1264         Matcher m = matcher(input);
1265 
1266         // Add segments before each match found
1267         while(m.find()) {
1268             if (!matchLimited || matchList.size() < limit - 1) {


1293         // Construct result
1294         int resultSize = matchList.size();
1295         if (limit == 0)
1296             while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
1297                 resultSize--;
1298         String[] result = new String[resultSize];
1299         return matchList.subList(0, resultSize).toArray(result);
1300     }
1301 
1302     /**
1303      * Splits the given input sequence around matches of this pattern.
1304      *
1305      * <p> This method works as if by invoking the two-argument {@link
1306      * #split(java.lang.CharSequence, int) split} method with the given input
1307      * sequence and a limit argument of zero.  Trailing empty strings are
1308      * therefore not included in the resulting array. </p>
1309      *
1310      * <p> The input {@code "boo:and:foo"}, for example, yields the following
1311      * results with these expressions:
1312      *
1313      * <blockquote><table>
1314      * <caption style="display:none">Split examples showing regex and result</caption>
1315      * <thead>
1316      * <tr><th style="text-align:left"><i>Regex&nbsp;&nbsp;&nbsp;&nbsp;</i></th>
1317      *     <th style="text-align:left"><i>Result</i></th></tr>
1318      * </thead>
1319      * <tbody>
1320      * <tr><td style="text-align:center">:</td>
1321      *     <td>{@code { "boo", "and", "foo" }}</td></tr>
1322      * <tr><td style="text-align:center">o</td>
1323      *     <td>{@code { "b", "", ":and:f" }}</td></tr>
1324      * </tbody>
1325      * </table></blockquote>
1326      *
1327      *
1328      * @param  input
1329      *         The character sequence to be split
1330      *
1331      * @return  The array of strings computed by splitting the input
1332      *          around matches of this pattern
1333      */
1334     public String[] split(CharSequence input) {
1335         return split(input, 0);
1336     }
1337 
1338     /**
1339      * Returns a literal pattern {@code String} for the specified
1340      * {@code String}.
1341      *
1342      * <p>This method produces a {@code String} that can be used to
1343      * create a {@code Pattern} that would match the string
1344      * {@code s} as if it were a literal pattern.</p> Metacharacters


< prev index next >