--- old/src/share/classes/java/lang/Character.java 2011-04-28 01:11:08.363069538 -0700 +++ new/src/share/classes/java/lang/Character.java 2011-04-28 01:11:07.959019313 -0700 @@ -59,14 +59,14 @@ *
The {@code char} data type (and therefore the value that a * {@code Character} object encapsulates) are based on the * original Unicode specification, which defined characters as - * fixed-width 16-bit entities. The Unicode standard has since been + * fixed-width 16-bit entities. The Unicode Standard has since been * changed to allow for characters whose representation requires more * than 16 bits. The range of legal code points is now * U+0000 to U+10FFFF, known as Unicode scalar value. * (Refer to the * definition of the U+n notation in the Unicode - * standard.) + * Standard.) * *
The set of characters from U+0000 to U+FFFF is
* sometimes referred to as the Basic Multilingual Plane (BMP).
@@ -5200,7 +5200,8 @@
*
* A character is lowercase if its general category type, provided
* by {@code Character.getType(ch)}, is
- * {@code LOWERCASE_LETTER}.
+ * {@code LOWERCASE_LETTER}, or it has contributory property
+ * Other_Lowercase as defined by the Unicode Standard.
*
* The following are examples of lowercase characters:
*
* A character is lowercase if its general category type, provided
* by {@link Character#getType getType(codePoint)}, is
- * {@code LOWERCASE_LETTER}.
+ * {@code LOWERCASE_LETTER}, or it has contributory property
+ * Other_Lowercase as defined by the Unicode Standard.
*
* The following are examples of lowercase characters:
*
* A character is uppercase if its general category type, provided by
* {@code Character.getType(ch)}, is {@code UPPERCASE_LETTER}.
+ * or it has contributory property Other_Uppercase as defined by the Unicode Standard.
*
* The following are examples of uppercase characters:
*
* A character is uppercase if its general category type, provided by
- * {@link Character#getType(int) getType(codePoint)}, is {@code UPPERCASE_LETTER}.
+ * {@link Character#getType(int) getType(codePoint)}, is {@code UPPERCASE_LETTER},
+ * or it has contributory property Other_Uppercase as defined by the Unicode Standard.
*
* The following are examples of uppercase characters:
*
+ * A character is considered to be alphabetic if its general category type,
+ * provided by {@link Character#getType(int) getType(codePoint)}, is any of
+ * the following:
+ *
@@ -6430,7 +6482,7 @@
/**
* Determines if the specified character is a Unicode space character.
* A character is considered to be a space character if and only if
- * it is specified to be a space character by the Unicode standard. This
+ * it is specified to be a space character by the Unicode Standard. This
* method returns true if the character's general category type is any of
* the following:
* This class is in conformance with Level 1 of Unicode Technical
- * Standard #18: Unicode Regular Expression Guidelines, plus RL2.1
+ * Standard #18: Unicode Regular Expression, plus RL2.1
* Canonical Equivalents.
- *
- * Unicode escape sequences such as \u2014 in Java source code
+ *
+ * Unicode escape sequences such as \u2014 in Java source code
* are processed as described in section 3.3 of
* The Java™ Language Specification.
- * Such escape sequences are also
- * implemented directly by the regular-expression parser so that Unicode
- * escapes can be used in expressions that are read from files or from the
- * keyboard. Thus the strings "\u2014" and "\\u2014",
- * while not equal, compile into the same pattern, which matches the character
- * with hexadecimal value 0x2014.
- *
- * A Unicode character can also be represented in a regular-expression by
- * using its hexadecimal code point value directly as described in construct
+ * Such escape sequences are also implemented directly by the regular-expression
+ * parser so that Unicode escapes can be used in expressions that are read from
+ * files or from the keyboard. Thus the strings "\u2014" and
+ * "\\u2014", while not equal, compile into the same pattern, which
+ * matches the character with hexadecimal value 0x2014.
+ *
+ * A Unicode character can also be represented in a regular-expression by
+ * using its Hex notation(hexadecimal code point value) directly as described in construct
* \x{...}, for example a supplementary character U+2011F
* can be specified as \x{2011F}, instead of two consecutive
* Unicode escape sequences of the surrogate pair
* \uD840\uDD1F.
- *
- *
- * Unicode scripts, blocks and categories are written with the \p and
- * \P constructs as in Perl. \p{prop} matches if
+ *
+ * Unicode scripts, blocks, categories and binary properties are written with
+ * the \p and \P constructs as in Perl.
+ * \p{prop} matches if
* the input has the property prop, while \P{prop}
* does not match if the input has that property.
*
- * Scripts are specified either with the prefix {@code Is}, as in
+ * Scripts, blocks, categories and binary properties can be used both inside
+ * and outside of a character class.
+ *
+ *
+ * Scripts are specified either with the prefix {@code Is}, as in
* {@code IsHiragana}, or by using the {@code script} keyword (or its short
* form {@code sc})as in {@code script=Hiragana} or {@code sc=Hiragana}.
*
- * Blocks are specified with the prefix {@code In}, as in
+ * The script names supported by
+ * Blocks are specified with the prefix {@code In}, as in
* {@code InMongolian}, or by using the keyword {@code block} (or its short
* form {@code blk}) as in {@code block=Mongolian} or {@code blk=Mongolian}.
*
- * Categories may be specified with the optional prefix {@code Is}:
+ * The block names supported by
+ *
+ * Categories may be specified with the optional prefix {@code Is}:
* Both {@code \p{L}} and {@code \p{IsL}} denote the category of Unicode
* letters. Same as scripts and blocks, categories can also be specified
* by using the keyword {@code general_category} (or its short form
* {@code gc}) as in {@code general_category=Lu} or {@code gc=Lu}.
*
- * Scripts, blocks and categories can be used both inside and outside of a
- * character class.
- * The supported categories are those of
+ * The supported categories are those of
*
* The Unicode Standard in the version specified by the
* {@link java.lang.Character Character} class. The category names are those
* defined in the Standard, both normative and informative.
- * The script names supported by
- * Categories that behave like the java.lang.Character
+ *
+ * Binary properties are specified with the prefix {@code Is}, as in
+ * {@code IsAlphabetic}. The supported binary properties by
+ * Predefined Character classes and POSIX character classes are in
+ * conformance with the recommendation of Annex C: Compatibility Properties
+ * of Unicode Regular Expression
+ * , when {@link #UNICODE_CHARACTER_CLASS} flag is specified.
+ *
+ *
+ *
+ * Categories that behave like the java.lang.Character
* boolean ismethodname methods (except for the deprecated ones) are
* available through the same \p{prop} syntax where
* the specified property has the name javamethodname.
@@ -796,6 +878,28 @@
*/
public static final int CANON_EQ = 0x80;
+ /**
+ * Enables Unicode version of Predefined character classes and
+ * POSIX character classes.
+ *
+ * When this flag is specified then the (US-ASCII only)
+ * Predefined character classes and POSIX character classes
+ * are in conformance with
+ * Unicode Technical
+ * Standard #18: Unicode Regular Expression
+ * Annex C: Compatibility Properties.
+ *
+ * The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded
+ * flag expression (?U).
+ *
+ * The flag implies UNICODE_CASE, that is, it enables Unicode-aware case
+ * folding.
+ *
+ * Specifying this flag may impose a performance penalty.
@@ -5235,7 +5236,8 @@
*
@@ -5257,7 +5259,8 @@
* @since 1.5
*/
public static boolean isLowerCase(int codePoint) {
- return getType(codePoint) == Character.LOWERCASE_LETTER;
+ return getType(codePoint) == Character.LOWERCASE_LETTER ||
+ CharacterData.of(codePoint).isOtherLowercase(codePoint);
}
/**
@@ -5265,6 +5268,7 @@
*
@@ -5298,7 +5302,8 @@
* Determines if the specified character (Unicode code point) is an uppercase character.
*
@@ -5320,7 +5325,8 @@
* @since 1.5
*/
public static boolean isUpperCase(int codePoint) {
- return getType(codePoint) == Character.UPPERCASE_LETTER;
+ return getType(codePoint) == Character.UPPERCASE_LETTER ||
+ CharacterData.of(codePoint).isOtherUppercase(codePoint);
}
/**
@@ -5725,6 +5731,52 @@
}
/**
+ * Determines if the specified character (Unicode code point) is an alphabet.
+ *
+ *
+ * or it has contributory property Other_Alphabetic as defined by the
+ * Unicode Standard.
+ *
+ * @param codePoint the character (Unicode code point) to be tested.
+ * @return UPPERCASE_LETTER
+ * LOWERCASE_LETTER
+ * TITLECASE_LETTER
+ * MODIFIER_LETTER
+ * OTHER_LETTER
+ * LETTER_NUMBER
+ * true
if the character is a Unicode alphabet
+ * character, false
otherwise.
+ * @since 1.7
+ */
+ public static boolean isAlphabetic(int codePoint) {
+ return (((((1 << Character.UPPERCASE_LETTER) |
+ (1 << Character.LOWERCASE_LETTER) |
+ (1 << Character.TITLECASE_LETTER) |
+ (1 << Character.MODIFIER_LETTER) |
+ (1 << Character.OTHER_LETTER) |
+ (1 << Character.LETTER_NUMBER)) >> getType(codePoint)) & 1) != 0) ||
+ CharacterData.of(codePoint).isOtherAlphabetic(codePoint);
+ }
+
+ /**
+ * Determines if the specified character (Unicode code point) is a CJKV
+ * (Chinese, Japanese, Korean and Vietnamese) ideograph, as defined by
+ * the Unicode Standard.
+ *
+ * @param codePoint the character (Unicode code point) to be tested.
+ * @return true
if the character is a Unicode ideograph
+ * character, false
otherwise.
+ * @since 1.7
+ */
+ public static boolean isIdeographic(int codePoint) {
+ return CharacterData.of(codePoint).isIdeographic(codePoint);
+ }
+
+ /**
* Determines if the specified character is
* permissible as the first character in a Java identifier.
*
@@ -6458,7 +6510,7 @@
* Determines if the specified character (Unicode code point) is a
* Unicode space character. A character is considered to be a
* space character if and only if it is specified to be a space
- * character by the Unicode standard. This method returns true if
+ * character by the Unicode Standard. This method returns true if
* the character's general category type is any of the following:
*
*
@@ -6908,7 +6960,7 @@
* @since 1.4
*/
static char[] toUpperCaseCharArray(int codePoint) {
- // As of Unicode 4.0, 1:M uppercasings only happen in the BMP.
+ // As of Unicode 6.0, 1:M uppercasings only happen in the BMP.
assert isBmpCodePoint(codePoint);
return CharacterData.of(codePoint).toUpperCaseCharArray(codePoint);
}
@@ -6941,7 +6993,7 @@
* Note: if the specified character is not assigned a name by
* the UnicodeData file (part of the Unicode Character
* Database maintained by the Unicode Consortium), the returned
- * name is the same as the result of expression
+ * name is the same as the result of expression.
*
*
{@code
* Character.UnicodeBlock.of(codePoint).toString().replace('_', ' ')
--- old/src/share/classes/java/lang/CharacterData.java 2011-04-28 01:11:18.993436621 -0700
+++ new/src/share/classes/java/lang/CharacterData.java 2011-04-28 01:11:18.637717073 -0700
@@ -46,10 +46,27 @@
int toUpperCaseEx(int ch) {
return toUpperCase(ch);
}
+
char[] toUpperCaseCharArray(int ch) {
return null;
}
+ boolean isOtherLowercase(int ch) {
+ return false;
+ }
+
+ boolean isOtherUppercase(int ch) {
+ return false;
+ }
+
+ boolean isOtherAlphabetic(int ch) {
+ return false;
+ }
+
+ boolean isIdeographic(int ch) {
+ return false;
+ }
+
// Character <= 0xff (basic latin) is handled by internal fast-path
// to avoid initializing large tables.
// Note: performance of this "fast-path" code may be sub-optimal
--- old/src/share/classes/java/util/regex/Pattern.java 2011-04-28 01:11:22.062420232 -0700
+++ new/src/share/classes/java/util/regex/Pattern.java 2011-04-28 01:11:21.713634738 -0700
@@ -206,13 +206,15 @@
*
Equivalent to java.lang.Character.isMirrored()
*
*
- *
+ * Classes for Unicode scripts, blocks and categories
* * Classes for Unicode scripts, blocks, categories and binary properties
+ * \p{IsLatin}
- * A Latin script character (simple script) A Latin script character (script)
*
+ * \p{InGreek}
- * A character in the Greek block (simple block) A character in the Greek block (block)
*
+ * \p{Lu}
- * An uppercase letter (simple category) An uppercase letter (category)
+ *
* \p{isAlphabetic}
+ * An alphabetic character (binary property)
* \p{Sc}
* A currency symbol
* \P{InGreek}
@@ -328,10 +330,11 @@
* X, as a named-capturing group
- * (?:X)
* X, as a non-capturing group (?idmsux-idmsux)
+ *
+ * u x U
+ * on - off
* (?idmsuxU-idmsuxU)
* Nothing, but turns match flags i
* d m s
- * u x on - off (?idmsux-idmsux:X)
* X, as a non-capturing group with the
* given flags i d
@@ -518,61 +521,140 @@
*
* Pattern
are the valid script names
+ * accepted and defined by
+ * {@link java.lang.Character.UnicodeScript#forName(String) UnicodeScript.forName}.
+ *
+ * Pattern
are the valid block names
+ * accepted and defined by
+ * {@link java.lang.Character.UnicodeBlock#forName(String) UnicodeBlock.forName}.
+ * Pattern
are the valid script names
- * accepted and defined by
- * {@link java.lang.Character.UnicodeScript#forName(String) UnicodeScript.forName}.
- * The block names supported by Pattern
are the valid block names
- * accepted and defined by
- * {@link java.lang.Character.UnicodeBlock#forName(String) UnicodeBlock.forName}.
* Pattern
+ * are
+ *
+ *
+
+
+ *
+ *
+ *
+ *
+ * Classes
+ * Matches
+ *
+ * \p{Lower}
+ * A lowercase character:\p{IsLowercase}
+ * \p{Upper}
+ * An uppercase character:\p{IsUppercase}
+ * \p{ASCII}
+ * All ASCII:[\x00-\x7F]
+ * \p{Alpha}
+ * An alphabetic character:\p{IsAlpahbetic}
+ * \p{Digit}
+ * A decimal digit character:p{IsDigit}
+ * \p{Alnum}
+ * An alphanumeric character:[\p{IsAlphabetic}\p{IsDigit}]
+ * \p{Punct}
+ * An punctuation character:p{IsPunctuation}
+ * \p{Graph}
+ * An visible character: [^\p{IsWhite_Space}\p{gc=Cc}\p{gc=Cs}\p{gc=Cn}]
+ * \p{Print}
+ * A printable character: [\p{Graph}\p{Blank}&&[^\p{Cntrl}]]
+ * \p{Blank}
+ * A space or a tab: [\p{IsWhite_Space}&&[^\p{gc=Zl}\p{gc=Zp}\x0a\x0b\x0c\x0d\x85]]
+ * \p{Cntrl}
+ * A control character: \p{gc=Cc}
+ * \p{XDigit}
+ * A hexadecimal digit: [\p{gc=Nd}\p{IsHex_Digit}]
+ * \p{Space}
+ * A whitespace character:\p{IsWhite_Space}
+ * \d
+ * A digit: \p{IsDigit}
+ * \D
+ * A non-digit: [^\d]
+ * \s
+ * A whitespace character: \p{IsWhite_Space}
+ * \S
+ * A non-whitespace character: [^\s]
+ * \w
+ * A word character: [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}]
+ * \W
+ * A non-word character: [^\w]