--- old/src/share/classes/java/util/regex/Pattern.java 2011-04-28 15:33:12.334988133 -0700 +++ new/src/share/classes/java/util/regex/Pattern.java 2011-04-28 15:33:11.999231406 -0700 @@ -206,13 +206,15 @@ *
This class is in conformance with Level 1 of Unicode Technical - * Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 + * Standard #18: Unicode Regular Expression, plus RL2.1 * Canonical Equivalents. - * - *
Unicode escape sequences such as \u2014 in Java source code + *
+ * Unicode escape sequences such as \u2014 in Java source code * are processed as described in section 3.3 of * The Java™ Language Specification. - * Such escape sequences are also - * implemented directly by the regular-expression parser so that Unicode - * escapes can be used in expressions that are read from files or from the - * keyboard. Thus the strings "\u2014" and "\\u2014", - * while not equal, compile into the same pattern, which matches the character - * with hexadecimal value 0x2014. - * - *
A Unicode character can also be represented in a regular-expression by - * using its hexadecimal code point value directly as described in construct + * Such escape sequences are also implemented directly by the regular-expression + * parser so that Unicode escapes can be used in expressions that are read from + * files or from the keyboard. Thus the strings "\u2014" and + * "\\u2014", while not equal, compile into the same pattern, which + * matches the character with hexadecimal value 0x2014. + *
+ * A Unicode character can also be represented in a regular-expression by
+ * using its Hex notation(hexadecimal code point value) directly as described in construct
* \x{...}, for example a supplementary character U+2011F
* can be specified as \x{2011F}, instead of two consecutive
* Unicode escape sequences of the surrogate pair
* \uD840\uDD1F.
- *
- *
- * Unicode scripts, blocks and categories are written with the \p and
- * \P constructs as in Perl. \p{prop} matches if
+ *
+ * Unicode scripts, blocks, categories and binary properties are written with
+ * the \p and \P constructs as in Perl.
+ * \p{prop} matches if
* the input has the property prop, while \P{prop}
* does not match if the input has that property.
*
- * Scripts are specified either with the prefix {@code Is}, as in
+ * Scripts, blocks, categories and binary properties can be used both inside
+ * and outside of a character class.
+ *
+ *
+ * Scripts are specified either with the prefix {@code Is}, as in
* {@code IsHiragana}, or by using the {@code script} keyword (or its short
* form {@code sc})as in {@code script=Hiragana} or {@code sc=Hiragana}.
*
- * Blocks are specified with the prefix {@code In}, as in
+ * The script names supported by
+ * Blocks are specified with the prefix {@code In}, as in
* {@code InMongolian}, or by using the keyword {@code block} (or its short
* form {@code blk}) as in {@code block=Mongolian} or {@code blk=Mongolian}.
*
- * Categories may be specified with the optional prefix {@code Is}:
+ * The block names supported by
+ *
+ * Categories may be specified with the optional prefix {@code Is}:
* Both {@code \p{L}} and {@code \p{IsL}} denote the category of Unicode
* letters. Same as scripts and blocks, categories can also be specified
* by using the keyword {@code general_category} (or its short form
* {@code gc}) as in {@code general_category=Lu} or {@code gc=Lu}.
*
- * Scripts, blocks and categories can be used both inside and outside of a
- * character class.
- * The supported categories are those of
+ * The supported categories are those of
*
* The Unicode Standard in the version specified by the
* {@link java.lang.Character Character} class. The category names are those
* defined in the Standard, both normative and informative.
- * The script names supported by
- * Categories that behave like the java.lang.Character
+ *
+ * Binary properties are specified with the prefix {@code Is}, as in
+ * {@code IsAlphabetic}. The supported binary properties by
+ * Predefined Character classes and POSIX character classes are in
+ * conformance with the recommendation of Annex C: Compatibility Properties
+ * of Unicode Regular Expression
+ * , when {@link #UNICODE_CHARACTER_CLASS} flag is specified.
+ *
+ * Pattern
are the valid script names
+ * accepted and defined by
+ * {@link java.lang.Character.UnicodeScript#forName(String) UnicodeScript.forName}.
+ *
+ * Pattern
are the valid block names
+ * accepted and defined by
+ * {@link java.lang.Character.UnicodeBlock#forName(String) UnicodeBlock.forName}.
+ * Pattern
are the valid script names
- * accepted and defined by
- * {@link java.lang.Character.UnicodeScript#forName(String) UnicodeScript.forName}.
- * The block names supported by Pattern
are the valid block names
- * accepted and defined by
- * {@link java.lang.Character.UnicodeBlock#forName(String) UnicodeBlock.forName}.
* Pattern
+ * are
+ *
+ *
+
+
+ *
Classes | + *Matches | + *
---|---|
\p{Lower} | + *A lowercase character:\p{IsLowercase} |
\p{Upper} | + *An uppercase character:\p{IsUppercase} |
\p{ASCII} | + *All ASCII:[\x00-\x7F] |
\p{Alpha} | + *An alphabetic character:\p{IsAlphabetic} |
\p{Digit} | + *A decimal digit character:p{IsDigit} |
\p{Alnum} | + *An alphanumeric character:[\p{IsAlphabetic}\p{IsDigit}] |
\p{Punct} | + *A punctuation character:p{IsPunctuation} |
\p{Graph} | + *A visible character: [^\p{IsWhite_Space}\p{gc=Cc}\p{gc=Cs}\p{gc=Cn}] |
\p{Print} | + *A printable character: [\p{Graph}\p{Blank}&&[^\p{Cntrl}]] |
\p{Blank} | + *A space or a tab: [\p{IsWhite_Space}&&[^\p{gc=Zl}\p{gc=Zp}\x0a\x0b\x0c\x0d\x85]] |
\p{Cntrl} | + *A control character: \p{gc=Cc} |
\p{XDigit} | + *A hexadecimal digit: [\p{gc=Nd}\p{IsHex_Digit}] |
\p{Space} | + *A whitespace character:\p{IsWhite_Space} |
\d | + *A digit: \p{IsDigit} |
\D | + *A non-digit: [^\d] |
\s | + *A whitespace character: \p{IsWhite_Space} |
\S | + *A non-whitespace character: [^\s] |
\w | + *A word character: [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}] |
\W | + *A non-word character: [^\w] |
+ *
+ * Categories that behave like the java.lang.Character
* boolean ismethodname methods (except for the deprecated ones) are
* available through the same \p{prop} syntax where
* the specified property has the name javamethodname.
@@ -796,6 +878,28 @@
*/
public static final int CANON_EQ = 0x80;
+ /**
+ * Enables the Unicode version of Predefined character classes and
+ * POSIX character classes.
+ *
+ * When this flag is specified then the (US-ASCII only)
+ * Predefined character classes and POSIX character classes
+ * are in conformance with
+ * Unicode Technical
+ * Standard #18: Unicode Regular Expression
+ * Annex C: Compatibility Properties.
+ *
+ * The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded
+ * flag expression (?U).
+ *
+ * The flag implies UNICODE_CASE, that is, it enables Unicode-aware case
+ * folding.
+ *
+ * Specifying this flag may impose a performance penalty.