62 * boolean b = m.{@link Matcher#matches matches}();</pre></blockquote>
63 *
64 * <p> A {@link #matches matches} method is defined by this class as a
65 * convenience for when a regular expression is used just once. This method
66 * compiles an expression and matches an input sequence against it in a single
67 * invocation. The statement
68 *
69 * <blockquote><pre>
70 * boolean b = Pattern.matches("a*b", "aaaaab");</pre></blockquote>
71 *
72 * is equivalent to the three statements above, though for repeated matches it
73 * is less efficient since it does not allow the compiled pattern to be reused.
74 *
75 * <p> Instances of this class are immutable and are safe for use by multiple
76 * concurrent threads. Instances of the {@link Matcher} class are not safe for
77 * such use.
78 *
79 *
80 * <h3><a id="sum">Summary of regular-expression constructs</a></h3>
81 *
82 * <table border="0" cellpadding="1" cellspacing="0"
83 * summary="Regular expression constructs, and what they match">
84 *
85 * <tr style="text-align:left">
86 * <th style="text-align:left" id="construct">Construct</th>
87 * <th style="text-align:left" id="matches">Matches</th>
88 * </tr>
89 *
90 * <tr><th> </th></tr>
91 * <tr style="text-align:left"><th colspan="2" id="characters">Characters</th></tr>
92 *
93 * <tr><td style="vertical-align:top" headers="construct characters"><i>x</i></td>
94 * <td headers="matches">The character <i>x</i></td></tr>
95 * <tr><td style="vertical-align:top" headers="construct characters">{@code \\}</td>
96 * <td headers="matches">The backslash character</td></tr>
97 * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>n</i></td>
98 * <td headers="matches">The character with octal value {@code 0}<i>n</i>
99 * (0 {@code <=} <i>n</i> {@code <=} 7)</td></tr>
100 * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>nn</i></td>
101 * <td headers="matches">The character with octal value {@code 0}<i>nn</i>
102 * (0 {@code <=} <i>n</i> {@code <=} 7)</td></tr>
103 * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>mnn</i></td>
104 * <td headers="matches">The character with octal value {@code 0}<i>mnn</i>
105 * (0 {@code <=} <i>m</i> {@code <=} 3,
106 * 0 {@code <=} <i>n</i> {@code <=} 7)</td></tr>
107 * <tr><td style="vertical-align:top" headers="construct characters">{@code \x}<i>hh</i></td>
108 * <td headers="matches">The character with hexadecimal value {@code 0x}<i>hh</i></td></tr>
366 * <td headers="matches">Nothing, but turns match flags <a href="#CASE_INSENSITIVE">i</a>
367 * <a href="#UNIX_LINES">d</a> <a href="#MULTILINE">m</a> <a href="#DOTALL">s</a>
368 * <a href="#UNICODE_CASE">u</a> <a href="#COMMENTS">x</a> <a href="#UNICODE_CHARACTER_CLASS">U</a>
369 * on - off</td></tr>
370 * <tr><td style="vertical-align:top" headers="construct special"><code>(?idmsux-idmsux:</code><i>X</i>{@code )} </td>
371 * <td headers="matches"><i>X</i>, as a <a href="#cg">non-capturing group</a> with the
372 * given flags <a href="#CASE_INSENSITIVE">i</a> <a href="#UNIX_LINES">d</a>
373 * <a href="#MULTILINE">m</a> <a href="#DOTALL">s</a> <a href="#UNICODE_CASE">u</a >
374 * <a href="#COMMENTS">x</a> on - off</td></tr>
375 * <tr><td style="vertical-align:top" headers="construct special">{@code (?=}<i>X</i>{@code )}</td>
376 * <td headers="matches"><i>X</i>, via zero-width positive lookahead</td></tr>
377 * <tr><td style="vertical-align:top" headers="construct special">{@code (?!}<i>X</i>{@code )}</td>
378 * <td headers="matches"><i>X</i>, via zero-width negative lookahead</td></tr>
379 * <tr><td style="vertical-align:top" headers="construct special">{@code (?<=}<i>X</i>{@code )}</td>
380 * <td headers="matches"><i>X</i>, via zero-width positive lookbehind</td></tr>
381 * <tr><td style="vertical-align:top" headers="construct special">{@code (?<!}<i>X</i>{@code )}</td>
382 * <td headers="matches"><i>X</i>, via zero-width negative lookbehind</td></tr>
383 * <tr><td style="vertical-align:top" headers="construct special">{@code (?>}<i>X</i>{@code )}</td>
384 * <td headers="matches"><i>X</i>, as an independent, non-capturing group</td></tr>
385 *
386 * </table>
387 *
388 * <hr>
389 *
390 *
391 * <h3><a id="bs">Backslashes, escapes, and quoting</a></h3>
392 *
393 * <p> The backslash character ({@code '\'}) serves to introduce escaped
394 * constructs, as defined in the table above, as well as to quote characters
395 * that otherwise would be interpreted as unescaped constructs. Thus the
396 * expression {@code \\} matches a single backslash and <code>\{</code> matches a
397 * left brace.
398 *
399 * <p> It is an error to use a backslash prior to any alphabetic character that
400 * does not denote an escaped construct; these are reserved for future
401 * extensions to the regular-expression language. A backslash may be used
402 * prior to a non-alphabetic character regardless of whether that character is
403 * part of an unescaped construct.
404 *
405 * <p> Backslashes within string literals in Java source code are interpreted
412 * <code>"\b"</code>, for example, matches a single backspace character when
413 * interpreted as a regular expression, while {@code "\\b"} matches a
414 * word boundary. The string literal {@code "\(hello\)"} is illegal
415 * and leads to a compile-time error; in order to match the string
416 * {@code (hello)} the string literal {@code "\\(hello\\)"}
417 * must be used.
418 *
419 * <h3><a id="cc">Character Classes</a></h3>
420 *
421 * <p> Character classes may appear within other character classes, and
422 * may be composed by the union operator (implicit) and the intersection
423 * operator ({@code &&}).
424 * The union operator denotes a class that contains every character that is
425 * in at least one of its operand classes. The intersection operator
426 * denotes a class that contains every character that is in both of its
427 * operand classes.
428 *
429 * <p> The precedence of character-class operators is as follows, from
430 * highest to lowest:
431 *
432 * <blockquote><table border="0" cellpadding="1" cellspacing="0"
433 * summary="Precedence of character class operators.">
434 * <tr><th>1 </th>
435 * <td>Literal escape </td>
436 * <td>{@code \x}</td></tr>
437 * <tr><th>2 </th>
438 * <td>Grouping</td>
439 * <td>{@code [...]}</td></tr>
440 * <tr><th>3 </th>
441 * <td>Range</td>
442 * <td>{@code a-z}</td></tr>
443 * <tr><th>4 </th>
444 * <td>Union</td>
445 * <td>{@code [a-e][i-u]}</td></tr>
446 * <tr><th>5 </th>
447 * <td>Intersection</td>
448 * <td>{@code [a-z&&[aeiou]]}</td></tr>
449 * </table></blockquote>
450 *
451 * <p> Note that a different set of metacharacters are in effect inside
452 * a character class than outside a character class. For instance, the
453 * regular expression {@code .} loses its special meaning inside a
454 * character class, while the expression {@code -} becomes a range
455 * forming metacharacter.
456 *
457 * <h3><a id="lt">Line terminators</a></h3>
458 *
459 * <p> A <i>line terminator</i> is a one- or two-character sequence that marks
460 * the end of a line of the input character sequence. The following are
461 * recognized as line terminators:
462 *
463 * <ul>
464 *
465 * <li> A newline (line feed) character ({@code '\n'}),
466 *
467 * <li> A carriage-return character followed immediately by a newline
468 * character ({@code "\r\n"}),
479 * <p>If {@link #UNIX_LINES} mode is activated, then the only line terminators
480 * recognized are newline characters.
481 *
482 * <p> The regular expression {@code .} matches any character except a line
483 * terminator unless the {@link #DOTALL} flag is specified.
484 *
485 * <p> By default, the regular expressions {@code ^} and {@code $} ignore
486 * line terminators and only match at the beginning and the end, respectively,
487 * of the entire input sequence. If {@link #MULTILINE} mode is activated then
488 * {@code ^} matches at the beginning of input and after any line terminator
489 * except at the end of input. When in {@link #MULTILINE} mode {@code $}
490 * matches just before a line terminator or the end of the input sequence.
491 *
492 * <h3><a id="cg">Groups and capturing</a></h3>
493 *
494 * <h4><a id="gnumber">Group number</a></h4>
495 * <p> Capturing groups are numbered by counting their opening parentheses from
496 * left to right. In the expression {@code ((A)(B(C)))}, for example, there
497 * are four such groups: </p>
498 *
499 * <blockquote><table cellpadding=1 cellspacing=0 summary="Capturing group numberings">
500 * <tr><th>1 </th>
501 * <td>{@code ((A)(B(C)))}</td></tr>
502 * <tr><th>2 </th>
503 * <td>{@code (A)}</td></tr>
504 * <tr><th>3 </th>
505 * <td>{@code (B(C))}</td></tr>
506 * <tr><th>4 </th>
507 * <td>{@code (C)}</td></tr>
508 * </table></blockquote>
509 *
510 * <p> Group zero always stands for the entire expression.
511 *
512 * <p> Capturing groups are so named because, during a match, each subsequence
513 * of the input sequence that matches such a group is saved. The captured
514 * subsequence may be used later in the expression, via a back reference, and
515 * may also be retrieved from the matcher once the match operation is complete.
516 *
517 * <h4><a id="groupname">Group name</a></h4>
518 * <p>A capturing group can also be assigned a "name", a {@code named-capturing group},
519 * and then be back-referenced later by the "name". Group names are composed of
520 * the following characters. The first character must be a {@code letter}.
521 *
522 * <ul>
523 * <li> The uppercase letters {@code 'A'} through {@code 'Z'}
524 * (<code>'\u0041'</code> through <code>'\u005a'</code>),
525 * <li> The lowercase letters {@code 'a'} through {@code 'z'}
526 * (<code>'\u0061'</code> through <code>'\u007a'</code>),
527 * <li> The digits {@code '0'} through {@code '9'}
624 * <li> Ideographic
625 * <li> Letter
626 * <li> Lowercase
627 * <li> Uppercase
628 * <li> Titlecase
629 * <li> Punctuation
630 * <Li> Control
631 * <li> White_Space
632 * <li> Digit
633 * <li> Hex_Digit
634 * <li> Join_Control
635 * <li> Noncharacter_Code_Point
636 * <li> Assigned
637 * </ul>
638 * <p>
639 * The following <b>Predefined Character classes</b> and <b>POSIX character classes</b>
640 * are in conformance with the recommendation of <i>Annex C: Compatibility Properties</i>
641 * of <a href="http://www.unicode.org/reports/tr18/"><i>Unicode Regular Expression
642 * </i></a>, when {@link #UNICODE_CHARACTER_CLASS} flag is specified.
643 *
644 * <table border="0" cellpadding="1" cellspacing="0"
645 * summary="predefined and posix character classes in Unicode mode">
646 * <tr style="text-align:left">
647 * <th style="text-align:left" id="predef_classes">Classes</th>
648 * <th style="text-align:left" id="predef_matches">Matches</th>
649 *</tr>
650 * <tr><td>{@code \p{Lower}}</td>
651 * <td>A lowercase character:{@code \p{IsLowercase}}</td></tr>
652 * <tr><td>{@code \p{Upper}}</td>
653 * <td>An uppercase character:{@code \p{IsUppercase}}</td></tr>
654 * <tr><td>{@code \p{ASCII}}</td>
655 * <td>All ASCII:{@code [\x00-\x7F]}</td></tr>
656 * <tr><td>{@code \p{Alpha}}</td>
657 * <td>An alphabetic character:{@code \p{IsAlphabetic}}</td></tr>
658 * <tr><td>{@code \p{Digit}}</td>
659 * <td>A decimal digit character:{@code p{IsDigit}}</td></tr>
660 * <tr><td>{@code \p{Alnum}}</td>
661 * <td>An alphanumeric character:{@code [\p{IsAlphabetic}\p{IsDigit}]}</td></tr>
662 * <tr><td>{@code \p{Punct}}</td>
663 * <td>A punctuation character:{@code p{IsPunctuation}}</td></tr>
664 * <tr><td>{@code \p{Graph}}</td>
665 * <td>A visible character: {@code [^\p{IsWhite_Space}\p{gc=Cc}\p{gc=Cs}\p{gc=Cn}]}</td></tr>
666 * <tr><td>{@code \p{Print}}</td>
667 * <td>A printable character: {@code [\p{Graph}\p{Blank}&&[^\p{Cntrl}]]}</td></tr>
668 * <tr><td>{@code \p{Blank}}</td>
669 * <td>A space or a tab: {@code [\p{IsWhite_Space}&&[^\p{gc=Zl}\p{gc=Zp}\x0a\x0b\x0c\x0d\x85]]}</td></tr>
670 * <tr><td>{@code \p{Cntrl}}</td>
671 * <td>A control character: {@code \p{gc=Cc}}</td></tr>
672 * <tr><td>{@code \p{XDigit}}</td>
673 * <td>A hexadecimal digit: {@code [\p{gc=Nd}\p{IsHex_Digit}]}</td></tr>
674 * <tr><td>{@code \p{Space}}</td>
675 * <td>A whitespace character:{@code \p{IsWhite_Space}}</td></tr>
676 * <tr><td>{@code \d}</td>
677 * <td>A digit: {@code \p{IsDigit}}</td></tr>
678 * <tr><td>{@code \D}</td>
679 * <td>A non-digit: {@code [^\d]}</td></tr>
680 * <tr><td>{@code \s}</td>
681 * <td>A whitespace character: {@code \p{IsWhite_Space}}</td></tr>
682 * <tr><td>{@code \S}</td>
683 * <td>A non-whitespace character: {@code [^\s]}</td></tr>
684 * <tr><td>{@code \w}</td>
685 * <td>A word character: {@code [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]}</td></tr>
686 * <tr><td>{@code \W}</td>
687 * <td>A non-word character: {@code [^\w]}</td></tr>
688 * </table>
689 * <p>
690 * <a id="jcc">
691 * Categories that behave like the java.lang.Character
692 * boolean is<i>methodname</i> methods (except for the deprecated ones) are
693 * available through the same <code>\p{</code><i>prop</i><code>}</code> syntax where
694 * the specified property has the name <code>java<i>methodname</i></code></a>.
695 *
696 * <h3> Comparison to Perl 5 </h3>
697 *
698 * <p>The {@code Pattern} engine performs traditional NFA-based matching
699 * with ordered alternation as occurs in Perl 5.
700 *
701 * <p> Perl constructs not supported by this class: </p>
702 *
703 * <ul>
704 * <li><p> The backreference constructs, <code>\g{</code><i>n</i><code>}</code> for
705 * the <i>n</i><sup>th</sup><a href="#cg">capturing group</a> and
706 * <code>\g{</code><i>name</i><code>}</code> for
707 * <a href="#groupname">named-capturing group</a>.
1190 *
1191 * <p> When there is a positive-width match at the beginning of the input
1192 * sequence then an empty leading substring is included at the beginning
1193 * of the resulting array. A zero-width match at the beginning however
1194 * never produces such empty leading substring.
1195 *
1196 * <p> The {@code limit} parameter controls the number of times the
1197 * pattern is applied and therefore affects the length of the resulting
1198 * array. If the limit <i>n</i> is greater than zero then the pattern
1199 * will be applied at most <i>n</i> - 1 times, the array's
1200 * length will be no greater than <i>n</i>, and the array's last entry
1201 * will contain all input beyond the last matched delimiter. If <i>n</i>
1202 * is non-positive then the pattern will be applied as many times as
1203 * possible and the array can have any length. If <i>n</i> is zero then
1204 * the pattern will be applied as many times as possible, the array can
1205 * have any length, and trailing empty strings will be discarded.
1206 *
1207 * <p> The input {@code "boo:and:foo"}, for example, yields the following
1208 * results with these parameters:
1209 *
1210 * <blockquote><table cellpadding=1 cellspacing=0
1211 * summary="Split examples showing regex, limit, and result">
1212 * <tr><th style="text-align:left"><i>Regex </i></th>
1213 * <th style="text-align:left"><i>Limit </i></th>
1214 * <th style="text-align:left"><i>Result </i></th></tr>
1215 * <tr><td style="text-align:center">:</td>
1216 * <td style="text-align:center">2</td>
1217 * <td>{@code { "boo", "and:foo" }}</td></tr>
1218 * <tr><td style="text-align:center">:</td>
1219 * <td style="text-align:center">5</td>
1220 * <td>{@code { "boo", "and", "foo" }}</td></tr>
1221 * <tr><td style="text-align:center">:</td>
1222 * <td style="text-align:center">-2</td>
1223 * <td>{@code { "boo", "and", "foo" }}</td></tr>
1224 * <tr><td style="text-align:center">o</td>
1225 * <td style="text-align:center">5</td>
1226 * <td>{@code { "b", "", ":and:f", "", "" }}</td></tr>
1227 * <tr><td style="text-align:center">o</td>
1228 * <td style="text-align:center">-2</td>
1229 * <td>{@code { "b", "", ":and:f", "", "" }}</td></tr>
1230 * <tr><td style="text-align:center">o</td>
1231 * <td style="text-align:center">0</td>
1232 * <td>{@code { "b", "", ":and:f" }}</td></tr>
1233 * </table></blockquote>
1234 *
1235 * @param input
1236 * The character sequence to be split
1237 *
1238 * @param limit
1239 * The result threshold, as described above
1240 *
1241 * @return The array of strings computed by splitting the input
1242 * around matches of this pattern
1243 */
1244 public String[] split(CharSequence input, int limit) {
1245 int index = 0;
1246 boolean matchLimited = limit > 0;
1247 ArrayList<String> matchList = new ArrayList<>();
1248 Matcher m = matcher(input);
1249
1250 // Add segments before each match found
1251 while(m.find()) {
1252 if (!matchLimited || matchList.size() < limit - 1) {
1277 // Construct result
1278 int resultSize = matchList.size();
1279 if (limit == 0)
1280 while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
1281 resultSize--;
1282 String[] result = new String[resultSize];
1283 return matchList.subList(0, resultSize).toArray(result);
1284 }
1285
1286 /**
1287 * Splits the given input sequence around matches of this pattern.
1288 *
1289 * <p> This method works as if by invoking the two-argument {@link
1290 * #split(java.lang.CharSequence, int) split} method with the given input
1291 * sequence and a limit argument of zero. Trailing empty strings are
1292 * therefore not included in the resulting array. </p>
1293 *
1294 * <p> The input {@code "boo:and:foo"}, for example, yields the following
1295 * results with these expressions:
1296 *
1297 * <blockquote><table cellpadding=1 cellspacing=0
1298 * summary="Split examples showing regex and result">
1299 * <tr><th style="text-align:left"><i>Regex </i></th>
1300 * <th style="text-align:left"><i>Result</i></th></tr>
1301 * <tr><td style="text-align:center">:</td>
1302 * <td>{@code { "boo", "and", "foo" }}</td></tr>
1303 * <tr><td style="text-align:center">o</td>
1304 * <td>{@code { "b", "", ":and:f" }}</td></tr>
1305 * </table></blockquote>
1306 *
1307 *
1308 * @param input
1309 * The character sequence to be split
1310 *
1311 * @return The array of strings computed by splitting the input
1312 * around matches of this pattern
1313 */
1314 public String[] split(CharSequence input) {
1315 return split(input, 0);
1316 }
1317
1318 /**
1319 * Returns a literal pattern {@code String} for the specified
1320 * {@code String}.
1321 *
1322 * <p>This method produces a {@code String} that can be used to
1323 * create a {@code Pattern} that would match the string
1324 * {@code s} as if it were a literal pattern.</p> Metacharacters
|
62 * boolean b = m.{@link Matcher#matches matches}();</pre></blockquote>
63 *
64 * <p> A {@link #matches matches} method is defined by this class as a
65 * convenience for when a regular expression is used just once. This method
66 * compiles an expression and matches an input sequence against it in a single
67 * invocation. The statement
68 *
69 * <blockquote><pre>
70 * boolean b = Pattern.matches("a*b", "aaaaab");</pre></blockquote>
71 *
72 * is equivalent to the three statements above, though for repeated matches it
73 * is less efficient since it does not allow the compiled pattern to be reused.
74 *
75 * <p> Instances of this class are immutable and are safe for use by multiple
76 * concurrent threads. Instances of the {@link Matcher} class are not safe for
77 * such use.
78 *
79 *
80 * <h3><a id="sum">Summary of regular-expression constructs</a></h3>
81 *
82 * <table class="borderless">
83 * <caption style="display:none">Regular expression constructs, and what they match</caption>
84 * <thead>
85 * <tr style="text-align:left">
86 * <th style="text-align:left" id="construct">Construct</th>
87 * <th style="text-align:left" id="matches">Matches</th>
88 * </tr>
89 * </thead>
90 * <tbody>
91 *
92 * <tr><th> </th></tr>
93 * <tr style="text-align:left"><th colspan="2" id="characters">Characters</th></tr>
94 *
95 * <tr><td style="vertical-align:top" headers="construct characters"><i>x</i></td>
96 * <td headers="matches">The character <i>x</i></td></tr>
97 * <tr><td style="vertical-align:top" headers="construct characters">{@code \\}</td>
98 * <td headers="matches">The backslash character</td></tr>
99 * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>n</i></td>
100 * <td headers="matches">The character with octal value {@code 0}<i>n</i>
101 * (0 {@code <=} <i>n</i> {@code <=} 7)</td></tr>
102 * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>nn</i></td>
103 * <td headers="matches">The character with octal value {@code 0}<i>nn</i>
104 * (0 {@code <=} <i>n</i> {@code <=} 7)</td></tr>
105 * <tr><td style="vertical-align:top" headers="construct characters">{@code \0}<i>mnn</i></td>
106 * <td headers="matches">The character with octal value {@code 0}<i>mnn</i>
107 * (0 {@code <=} <i>m</i> {@code <=} 3,
108 * 0 {@code <=} <i>n</i> {@code <=} 7)</td></tr>
109 * <tr><td style="vertical-align:top" headers="construct characters">{@code \x}<i>hh</i></td>
110 * <td headers="matches">The character with hexadecimal value {@code 0x}<i>hh</i></td></tr>
368 * <td headers="matches">Nothing, but turns match flags <a href="#CASE_INSENSITIVE">i</a>
369 * <a href="#UNIX_LINES">d</a> <a href="#MULTILINE">m</a> <a href="#DOTALL">s</a>
370 * <a href="#UNICODE_CASE">u</a> <a href="#COMMENTS">x</a> <a href="#UNICODE_CHARACTER_CLASS">U</a>
371 * on - off</td></tr>
372 * <tr><td style="vertical-align:top" headers="construct special"><code>(?idmsux-idmsux:</code><i>X</i>{@code )} </td>
373 * <td headers="matches"><i>X</i>, as a <a href="#cg">non-capturing group</a> with the
374 * given flags <a href="#CASE_INSENSITIVE">i</a> <a href="#UNIX_LINES">d</a>
375 * <a href="#MULTILINE">m</a> <a href="#DOTALL">s</a> <a href="#UNICODE_CASE">u</a >
376 * <a href="#COMMENTS">x</a> on - off</td></tr>
377 * <tr><td style="vertical-align:top" headers="construct special">{@code (?=}<i>X</i>{@code )}</td>
378 * <td headers="matches"><i>X</i>, via zero-width positive lookahead</td></tr>
379 * <tr><td style="vertical-align:top" headers="construct special">{@code (?!}<i>X</i>{@code )}</td>
380 * <td headers="matches"><i>X</i>, via zero-width negative lookahead</td></tr>
381 * <tr><td style="vertical-align:top" headers="construct special">{@code (?<=}<i>X</i>{@code )}</td>
382 * <td headers="matches"><i>X</i>, via zero-width positive lookbehind</td></tr>
383 * <tr><td style="vertical-align:top" headers="construct special">{@code (?<!}<i>X</i>{@code )}</td>
384 * <td headers="matches"><i>X</i>, via zero-width negative lookbehind</td></tr>
385 * <tr><td style="vertical-align:top" headers="construct special">{@code (?>}<i>X</i>{@code )}</td>
386 * <td headers="matches"><i>X</i>, as an independent, non-capturing group</td></tr>
387 *
388 * </tbody>
389 * </table>
390 *
391 * <hr>
392 *
393 *
394 * <h3><a id="bs">Backslashes, escapes, and quoting</a></h3>
395 *
396 * <p> The backslash character ({@code '\'}) serves to introduce escaped
397 * constructs, as defined in the table above, as well as to quote characters
398 * that otherwise would be interpreted as unescaped constructs. Thus the
399 * expression {@code \\} matches a single backslash and <code>\{</code> matches a
400 * left brace.
401 *
402 * <p> It is an error to use a backslash prior to any alphabetic character that
403 * does not denote an escaped construct; these are reserved for future
404 * extensions to the regular-expression language. A backslash may be used
405 * prior to a non-alphabetic character regardless of whether that character is
406 * part of an unescaped construct.
407 *
408 * <p> Backslashes within string literals in Java source code are interpreted
415 * <code>"\b"</code>, for example, matches a single backspace character when
416 * interpreted as a regular expression, while {@code "\\b"} matches a
417 * word boundary. The string literal {@code "\(hello\)"} is illegal
418 * and leads to a compile-time error; in order to match the string
419 * {@code (hello)} the string literal {@code "\\(hello\\)"}
420 * must be used.
421 *
422 * <h3><a id="cc">Character Classes</a></h3>
423 *
424 * <p> Character classes may appear within other character classes, and
425 * may be composed by the union operator (implicit) and the intersection
426 * operator ({@code &&}).
427 * The union operator denotes a class that contains every character that is
428 * in at least one of its operand classes. The intersection operator
429 * denotes a class that contains every character that is in both of its
430 * operand classes.
431 *
432 * <p> The precedence of character-class operators is as follows, from
433 * highest to lowest:
434 *
435 * <blockquote><table>
436 * <caption style="display:none">Precedence of character class operators.</caption>
437 * <tbody>
438 * <tr><th>1 </th>
439 * <td>Literal escape </td>
440 * <td>{@code \x}</td></tr>
441 * <tr><th>2 </th>
442 * <td>Grouping</td>
443 * <td>{@code [...]}</td></tr>
444 * <tr><th>3 </th>
445 * <td>Range</td>
446 * <td>{@code a-z}</td></tr>
447 * <tr><th>4 </th>
448 * <td>Union</td>
449 * <td>{@code [a-e][i-u]}</td></tr>
450 * <tr><th>5 </th>
451 * <td>Intersection</td>
452 * <td>{@code [a-z&&[aeiou]]}</td></tr>
453 * </tbody>
454 * </table></blockquote>
455 *
456 * <p> Note that a different set of metacharacters are in effect inside
457 * a character class than outside a character class. For instance, the
458 * regular expression {@code .} loses its special meaning inside a
459 * character class, while the expression {@code -} becomes a range
460 * forming metacharacter.
461 *
462 * <h3><a id="lt">Line terminators</a></h3>
463 *
464 * <p> A <i>line terminator</i> is a one- or two-character sequence that marks
465 * the end of a line of the input character sequence. The following are
466 * recognized as line terminators:
467 *
468 * <ul>
469 *
470 * <li> A newline (line feed) character ({@code '\n'}),
471 *
472 * <li> A carriage-return character followed immediately by a newline
473 * character ({@code "\r\n"}),
484 * <p>If {@link #UNIX_LINES} mode is activated, then the only line terminators
485 * recognized are newline characters.
486 *
487 * <p> The regular expression {@code .} matches any character except a line
488 * terminator unless the {@link #DOTALL} flag is specified.
489 *
490 * <p> By default, the regular expressions {@code ^} and {@code $} ignore
491 * line terminators and only match at the beginning and the end, respectively,
492 * of the entire input sequence. If {@link #MULTILINE} mode is activated then
493 * {@code ^} matches at the beginning of input and after any line terminator
494 * except at the end of input. When in {@link #MULTILINE} mode {@code $}
495 * matches just before a line terminator or the end of the input sequence.
496 *
497 * <h3><a id="cg">Groups and capturing</a></h3>
498 *
499 * <h4><a id="gnumber">Group number</a></h4>
500 * <p> Capturing groups are numbered by counting their opening parentheses from
501 * left to right. In the expression {@code ((A)(B(C)))}, for example, there
502 * are four such groups: </p>
503 *
504 * <blockquote><table>
505 * <caption style="display:none">Capturing group numberings</caption>
506 * <tbody>
507 * <tr><th>1 </th>
508 * <td>{@code ((A)(B(C)))}</td></tr>
509 * <tr><th>2 </th>
510 * <td>{@code (A)}</td></tr>
511 * <tr><th>3 </th>
512 * <td>{@code (B(C))}</td></tr>
513 * <tr><th>4 </th>
514 * <td>{@code (C)}</td></tr>
515 * </tbody>
516 * </table></blockquote>
517 *
518 * <p> Group zero always stands for the entire expression.
519 *
520 * <p> Capturing groups are so named because, during a match, each subsequence
521 * of the input sequence that matches such a group is saved. The captured
522 * subsequence may be used later in the expression, via a back reference, and
523 * may also be retrieved from the matcher once the match operation is complete.
524 *
525 * <h4><a id="groupname">Group name</a></h4>
526 * <p>A capturing group can also be assigned a "name", a {@code named-capturing group},
527 * and then be back-referenced later by the "name". Group names are composed of
528 * the following characters. The first character must be a {@code letter}.
529 *
530 * <ul>
531 * <li> The uppercase letters {@code 'A'} through {@code 'Z'}
532 * (<code>'\u0041'</code> through <code>'\u005a'</code>),
533 * <li> The lowercase letters {@code 'a'} through {@code 'z'}
534 * (<code>'\u0061'</code> through <code>'\u007a'</code>),
535 * <li> The digits {@code '0'} through {@code '9'}
632 * <li> Ideographic
633 * <li> Letter
634 * <li> Lowercase
635 * <li> Uppercase
636 * <li> Titlecase
637 * <li> Punctuation
638 * <Li> Control
639 * <li> White_Space
640 * <li> Digit
641 * <li> Hex_Digit
642 * <li> Join_Control
643 * <li> Noncharacter_Code_Point
644 * <li> Assigned
645 * </ul>
646 * <p>
647 * The following <b>Predefined Character classes</b> and <b>POSIX character classes</b>
648 * are in conformance with the recommendation of <i>Annex C: Compatibility Properties</i>
649 * of <a href="http://www.unicode.org/reports/tr18/"><i>Unicode Regular Expression
650 * </i></a>, when {@link #UNICODE_CHARACTER_CLASS} flag is specified.
651 *
652 * <table>
653 * <caption style="display:none">predefined and posix character classes in Unicode mode</caption>
654 * <thead>
655 * <tr style="text-align:left">
656 * <th style="text-align:left" id="predef_classes">Classes</th>
657 * <th style="text-align:left" id="predef_matches">Matches</th>
658 * </tr>
659 * </thead>
660 * <tbody>
661 * <tr><td>{@code \p{Lower}}</td>
662 * <td>A lowercase character:{@code \p{IsLowercase}}</td></tr>
663 * <tr><td>{@code \p{Upper}}</td>
664 * <td>An uppercase character:{@code \p{IsUppercase}}</td></tr>
665 * <tr><td>{@code \p{ASCII}}</td>
666 * <td>All ASCII:{@code [\x00-\x7F]}</td></tr>
667 * <tr><td>{@code \p{Alpha}}</td>
668 * <td>An alphabetic character:{@code \p{IsAlphabetic}}</td></tr>
669 * <tr><td>{@code \p{Digit}}</td>
670 * <td>A decimal digit character:{@code p{IsDigit}}</td></tr>
671 * <tr><td>{@code \p{Alnum}}</td>
672 * <td>An alphanumeric character:{@code [\p{IsAlphabetic}\p{IsDigit}]}</td></tr>
673 * <tr><td>{@code \p{Punct}}</td>
674 * <td>A punctuation character:{@code p{IsPunctuation}}</td></tr>
675 * <tr><td>{@code \p{Graph}}</td>
676 * <td>A visible character: {@code [^\p{IsWhite_Space}\p{gc=Cc}\p{gc=Cs}\p{gc=Cn}]}</td></tr>
677 * <tr><td>{@code \p{Print}}</td>
678 * <td>A printable character: {@code [\p{Graph}\p{Blank}&&[^\p{Cntrl}]]}</td></tr>
679 * <tr><td>{@code \p{Blank}}</td>
680 * <td>A space or a tab: {@code [\p{IsWhite_Space}&&[^\p{gc=Zl}\p{gc=Zp}\x0a\x0b\x0c\x0d\x85]]}</td></tr>
681 * <tr><td>{@code \p{Cntrl}}</td>
682 * <td>A control character: {@code \p{gc=Cc}}</td></tr>
683 * <tr><td>{@code \p{XDigit}}</td>
684 * <td>A hexadecimal digit: {@code [\p{gc=Nd}\p{IsHex_Digit}]}</td></tr>
685 * <tr><td>{@code \p{Space}}</td>
686 * <td>A whitespace character:{@code \p{IsWhite_Space}}</td></tr>
687 * <tr><td>{@code \d}</td>
688 * <td>A digit: {@code \p{IsDigit}}</td></tr>
689 * <tr><td>{@code \D}</td>
690 * <td>A non-digit: {@code [^\d]}</td></tr>
691 * <tr><td>{@code \s}</td>
692 * <td>A whitespace character: {@code \p{IsWhite_Space}}</td></tr>
693 * <tr><td>{@code \S}</td>
694 * <td>A non-whitespace character: {@code [^\s]}</td></tr>
695 * <tr><td>{@code \w}</td>
696 * <td>A word character: {@code [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]}</td></tr>
697 * <tr><td>{@code \W}</td>
698 * <td>A non-word character: {@code [^\w]}</td></tr>
699 * </tbody>
700 * </table>
701 * <p>
702 * <a id="jcc">
703 * Categories that behave like the java.lang.Character
704 * boolean is<i>methodname</i> methods (except for the deprecated ones) are
705 * available through the same <code>\p{</code><i>prop</i><code>}</code> syntax where
706 * the specified property has the name <code>java<i>methodname</i></code></a>.
707 *
708 * <h3> Comparison to Perl 5 </h3>
709 *
710 * <p>The {@code Pattern} engine performs traditional NFA-based matching
711 * with ordered alternation as occurs in Perl 5.
712 *
713 * <p> Perl constructs not supported by this class: </p>
714 *
715 * <ul>
716 * <li><p> The backreference constructs, <code>\g{</code><i>n</i><code>}</code> for
717 * the <i>n</i><sup>th</sup><a href="#cg">capturing group</a> and
718 * <code>\g{</code><i>name</i><code>}</code> for
719 * <a href="#groupname">named-capturing group</a>.
1202 *
1203 * <p> When there is a positive-width match at the beginning of the input
1204 * sequence then an empty leading substring is included at the beginning
1205 * of the resulting array. A zero-width match at the beginning however
1206 * never produces such empty leading substring.
1207 *
1208 * <p> The {@code limit} parameter controls the number of times the
1209 * pattern is applied and therefore affects the length of the resulting
1210 * array. If the limit <i>n</i> is greater than zero then the pattern
1211 * will be applied at most <i>n</i> - 1 times, the array's
1212 * length will be no greater than <i>n</i>, and the array's last entry
1213 * will contain all input beyond the last matched delimiter. If <i>n</i>
1214 * is non-positive then the pattern will be applied as many times as
1215 * possible and the array can have any length. If <i>n</i> is zero then
1216 * the pattern will be applied as many times as possible, the array can
1217 * have any length, and trailing empty strings will be discarded.
1218 *
1219 * <p> The input {@code "boo:and:foo"}, for example, yields the following
1220 * results with these parameters:
1221 *
1222 * <blockquote><table>
1223 * <caption>Split examples showing regex, limit, and result</caption>
1224 * <thead>
1225 * <tr><th style="text-align:left"><i>Regex </i></th>
1226 * <th style="text-align:left"><i>Limit </i></th>
1227 * <th style="text-align:left"><i>Result </i></th></tr>
1228 * </thead>
1229 * <tbody>
1230 * <tr><td style="text-align:center">:</td>
1231 * <td style="text-align:center">2</td>
1232 * <td>{@code { "boo", "and:foo" }}</td></tr>
1233 * <tr><td style="text-align:center">:</td>
1234 * <td style="text-align:center">5</td>
1235 * <td>{@code { "boo", "and", "foo" }}</td></tr>
1236 * <tr><td style="text-align:center">:</td>
1237 * <td style="text-align:center">-2</td>
1238 * <td>{@code { "boo", "and", "foo" }}</td></tr>
1239 * <tr><td style="text-align:center">o</td>
1240 * <td style="text-align:center">5</td>
1241 * <td>{@code { "b", "", ":and:f", "", "" }}</td></tr>
1242 * <tr><td style="text-align:center">o</td>
1243 * <td style="text-align:center">-2</td>
1244 * <td>{@code { "b", "", ":and:f", "", "" }}</td></tr>
1245 * <tr><td style="text-align:center">o</td>
1246 * <td style="text-align:center">0</td>
1247 * <td>{@code { "b", "", ":and:f" }}</td></tr>
1248 * </tbody>
1249 * </table></blockquote>
1250 *
1251 * @param input
1252 * The character sequence to be split
1253 *
1254 * @param limit
1255 * The result threshold, as described above
1256 *
1257 * @return The array of strings computed by splitting the input
1258 * around matches of this pattern
1259 */
1260 public String[] split(CharSequence input, int limit) {
1261 int index = 0;
1262 boolean matchLimited = limit > 0;
1263 ArrayList<String> matchList = new ArrayList<>();
1264 Matcher m = matcher(input);
1265
1266 // Add segments before each match found
1267 while(m.find()) {
1268 if (!matchLimited || matchList.size() < limit - 1) {
1293 // Construct result
1294 int resultSize = matchList.size();
1295 if (limit == 0)
1296 while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
1297 resultSize--;
1298 String[] result = new String[resultSize];
1299 return matchList.subList(0, resultSize).toArray(result);
1300 }
1301
1302 /**
1303 * Splits the given input sequence around matches of this pattern.
1304 *
1305 * <p> This method works as if by invoking the two-argument {@link
1306 * #split(java.lang.CharSequence, int) split} method with the given input
1307 * sequence and a limit argument of zero. Trailing empty strings are
1308 * therefore not included in the resulting array. </p>
1309 *
1310 * <p> The input {@code "boo:and:foo"}, for example, yields the following
1311 * results with these expressions:
1312 *
1313 * <blockquote><table>
1314 * <caption style="display:none">Split examples showing regex and result</caption>
1315 * <thead>
1316 * <tr><th style="text-align:left"><i>Regex </i></th>
1317 * <th style="text-align:left"><i>Result</i></th></tr>
1318 * </thead>
1319 * <tbody>
1320 * <tr><td style="text-align:center">:</td>
1321 * <td>{@code { "boo", "and", "foo" }}</td></tr>
1322 * <tr><td style="text-align:center">o</td>
1323 * <td>{@code { "b", "", ":and:f" }}</td></tr>
1324 * </tbody>
1325 * </table></blockquote>
1326 *
1327 *
1328 * @param input
1329 * The character sequence to be split
1330 *
1331 * @return The array of strings computed by splitting the input
1332 * around matches of this pattern
1333 */
1334 public String[] split(CharSequence input) {
1335 return split(input, 0);
1336 }
1337
1338 /**
1339 * Returns a literal pattern {@code String} for the specified
1340 * {@code String}.
1341 *
1342 * <p>This method produces a {@code String} that can be used to
1343 * create a {@code Pattern} that would match the string
1344 * {@code s} as if it were a literal pattern.</p> Metacharacters
|