/* * Copyright (c) 2009, 2015, Oracle and/or its affiliates. All rights reserved. * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. * * This code is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License version 2 only, as * published by the Free Software Foundation. Oracle designates this * particular file as subject to the "Classpath" exception as provided * by Oracle in the LICENSE file that accompanied this code. * * This code is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License * version 2 for more details (a copy is included in the LICENSE file that * accompanied this code). * * You should have received a copy of the GNU General Public License version * 2 along with this work; if not, write to the Free Software Foundation, * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. * * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA * or visit www.oracle.com if you need additional information or have any * questions. */ /** ******************************************************************************* * Copyright (C) 1996-2014, International Business Machines Corporation and * others. All Rights Reserved. ******************************************************************************* */ package sun.text.normalizer; /** *

The UCharacter class provides extensions to the * * java.lang.Character class. These extensions provide support for * more Unicode properties and together with the UTF16 * class, provide support for supplementary characters (those with code * points above U+FFFF). * Each ICU release supports the latest version of Unicode available at that time. * *

Code points are represented in these API using ints. While it would be * more convenient in Java to have a separate primitive datatype for them, * ints suffice in the meantime. * *

To use this class please add the jar file name icu4j.jar to the * class path, since it contains data files which supply the information used * by this file.
* E.g. In Windows
* set CLASSPATH=%CLASSPATH%;$JAR_FILE_PATH/ucharacter.jar.
* Otherwise, another method would be to copy the files uprops.dat and * unames.icu from the icu4j source subdirectory * $ICU4J_SRC/src/com.ibm.icu.impl.data to your class directory * $ICU4J_CLASS/com.ibm.icu.impl.data. * *

Aside from the additions for UTF-16 support, and the updated Unicode * properties, the main differences between UCharacter and Character are: *

*

* Further detail on differences can be determined using the program * * com.ibm.icu.dev.test.lang.UCharacterCompare *

*

* In addition to Java compatibility functions, which calculate derived properties, * this API provides low-level access to the Unicode Character Database. *

*

* Unicode assigns each code point (not just assigned character) values for * many properties. * Most of them are simple boolean flags, or constants from a small enumerated list. * For some properties, values are strings or other relatively more complex types. *

*

* For more information see * "About the Unicode Character Database" * (http://www.unicode.org/ucd/) * and the ICU * User Guide chapter on Properties * (http://www.icu-project.org/userguide/properties.html). *

*

* There are also functions that provide easy migration from C/POSIX functions * like isblank(). Their use is generally discouraged because the C/POSIX * standards do not define their semantics beyond the ASCII range, which means * that different implementations exhibit very different behavior. * Instead, Unicode properties should be used directly. *

*

* There are also only a few, broad C/POSIX character classes, and they tend * to be used for conflicting purposes. For example, the "isalpha()" class * is sometimes used to determine word boundaries, while a more sophisticated * approach would at least distinguish initial letters from continuation * characters (the latter including combining marks). * (In ICU, BreakIterator is the most sophisticated API for word boundaries.) * Another example: There is no "istitle()" class for titlecase characters. *

*

* ICU 3.4 and later provides API access for all twelve C/POSIX character classes. * ICU implements them according to the Standard Recommendations in * Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions * (http://www.unicode.org/reports/tr18/#Compatibility_Properties). *

*

* API access for C/POSIX character classes is as follows: *

{@code
 * - alpha:     isUAlphabetic(c) or hasBinaryProperty(c, UProperty.ALPHABETIC)
 * - lower:     isULowercase(c) or hasBinaryProperty(c, UProperty.LOWERCASE)
 * - upper:     isUUppercase(c) or hasBinaryProperty(c, UProperty.UPPERCASE)
 * - punct:     ((1<
 * 

*

* The C/POSIX character classes are also available in UnicodeSet patterns, * using patterns like [:graph:] or \p{graph}. *

* * There are several ICU (and Java) whitespace functions. * Comparison: *

*

* This class is not subclassable. *

* @author Syn Wee Quek * @stable ICU 2.1 * @see com.ibm.icu.lang.UCharacterEnums */ public final class UCharacter { /** * Joining Group constants. * @see UProperty#JOINING_GROUP * @stable ICU 2.4 */ public static interface JoiningGroup { /** * @stable ICU 2.4 */ public static final int NO_JOINING_GROUP = 0; } /** * Numeric Type constants. * @see UProperty#NUMERIC_TYPE * @stable ICU 2.4 */ public static interface NumericType { /** * @stable ICU 2.4 */ public static final int NONE = 0; /** * @stable ICU 2.4 */ public static final int DECIMAL = 1; /** * @stable ICU 2.4 */ public static final int DIGIT = 2; /** * @stable ICU 2.4 */ public static final int NUMERIC = 3; /** * @stable ICU 2.4 */ public static final int COUNT = 4; } /** * Hangul Syllable Type constants. * * @see UProperty#HANGUL_SYLLABLE_TYPE * @stable ICU 2.6 */ public static interface HangulSyllableType { /** * @stable ICU 2.6 */ public static final int NOT_APPLICABLE = 0; /*[NA]*/ /*See note !!*/ /** * @stable ICU 2.6 */ public static final int LEADING_JAMO = 1; /*[L]*/ /** * @stable ICU 2.6 */ public static final int VOWEL_JAMO = 2; /*[V]*/ /** * @stable ICU 2.6 */ public static final int TRAILING_JAMO = 3; /*[T]*/ /** * @stable ICU 2.6 */ public static final int LV_SYLLABLE = 4; /*[LV]*/ /** * @stable ICU 2.6 */ public static final int LVT_SYLLABLE = 5; /*[LVT]*/ /** * @stable ICU 2.6 */ public static final int COUNT = 6; } // public data members ----------------------------------------------- /** * The lowest Unicode code point value. * @stable ICU 2.1 */ public static final int MIN_VALUE = UTF16.CODEPOINT_MIN_VALUE; /** * The highest Unicode code point value (scalar value) according to the * Unicode Standard. * This is a 21-bit value (21 bits, rounded up).
* Up-to-date Unicode implementation of java.lang.Character.MAX_VALUE * @stable ICU 2.1 */ public static final int MAX_VALUE = UTF16.CODEPOINT_MAX_VALUE; // public methods ---------------------------------------------------- /** * Returns the numeric value of a decimal digit code point. *
This method observes the semantics of * java.lang.Character.digit(). Note that this * will return positive values for code points for which isDigit * returns false, just like java.lang.Character. *
Semantic Change: In release 1.3.1 and * prior, this did not treat the European letters as having a * digit value, and also treated numeric letters and other numbers as * digits. * This has been changed to conform to the java semantics. *
A code point is a valid digit if and only if: * * @param ch the code point to query * @param radix the radix * @return the numeric value represented by the code point in the * specified radix, or -1 if the code point is not a decimal digit * or if its value is too large for the radix * @stable ICU 2.1 */ public static int digit(int ch, int radix) { if (2 <= radix && radix <= 36) { int value = digit(ch); if (value < 0) { // ch is not a decimal digit, try latin letters value = UCharacterProperty.getEuropeanDigit(ch); } return (value < radix) ? value : -1; } else { return -1; // invalid radix } } /** * Returns the numeric value of a decimal digit code point. *
This is a convenience overload of digit(int, int) * that provides a decimal radix. *
Semantic Change: In release 1.3.1 and prior, this * treated numeric letters and other numbers as digits. This has * been changed to conform to the java semantics. * @param ch the code point to query * @return the numeric value represented by the code point, * or -1 if the code point is not a decimal digit or if its * value is too large for a decimal radix * @stable ICU 2.1 */ public static int digit(int ch) { return UCharacterProperty.INSTANCE.digit(ch); } /** * Returns a value indicating a code point's Unicode category. * Up-to-date Unicode implementation of java.lang.Character.getType() * except for the above mentioned code points that had their category * changed.
* Return results are constants from the interface * UCharacterCategory
* NOTE: the UCharacterCategory values are not compatible with * those returned by java.lang.Character.getType. UCharacterCategory values * match the ones used in ICU4C, while java.lang.Character type * values, though similar, skip the value 17.

* @param ch code point whose type is to be determined * @return category which is a value of UCharacterCategory * @stable ICU 2.1 */ public static int getType(int ch) { return UCharacterProperty.INSTANCE.getType(ch); } /** * Returns the Bidirection property of a code point. * For example, 0x0041 (letter A) has the LEFT_TO_RIGHT directional * property.
* Result returned belongs to the interface * UCharacterDirection * @param ch the code point to be determined its direction * @return direction constant from UCharacterDirection. * @stable ICU 2.1 */ public static int getDirection(int ch) { return UBiDiProps.INSTANCE.getClass(ch); } /** * Maps the specified code point to a "mirror-image" code point. * For code points with the "mirrored" property, implementations sometimes * need a "poor man's" mapping to another code point such that the default * glyph may serve as the mirror-image of the default glyph of the * specified code point.
* This is useful for text conversion to and from codepages with visual * order, and for displays without glyph selection capabilities. * @param ch code point whose mirror is to be retrieved * @return another code point that may serve as a mirror-image substitute, * or ch itself if there is no such mapping or ch does not have the * "mirrored" property * @stable ICU 2.1 */ public static int getMirror(int ch) { return UBiDiProps.INSTANCE.getMirror(ch); } /** * Maps the specified character to its paired bracket character. * For Bidi_Paired_Bracket_Type!=None, this is the same as getMirror(int). * Otherwise c itself is returned. * See http://www.unicode.org/reports/tr9/ * * @param c the code point to be mapped * @return the paired bracket code point, * or c itself if there is no such mapping * (Bidi_Paired_Bracket_Type=None) * * @see UProperty#BIDI_PAIRED_BRACKET * @see UProperty#BIDI_PAIRED_BRACKET_TYPE * @see #getMirror(int) * @stable ICU 52 */ public static int getBidiPairedBracket(int c) { return UBiDiProps.INSTANCE.getPairedBracket(c); } /** * Returns the combining class of the argument codepoint * @param ch code point whose combining is to be retrieved * @return the combining class of the codepoint * @stable ICU 2.1 */ public static int getCombiningClass(int ch) { return Normalizer2.getNFDInstance().getCombiningClass(ch); } /** * Returns the version of Unicode data used. * @return the unicode version number used * @stable ICU 2.1 */ public static VersionInfo getUnicodeVersion() { return UCharacterProperty.INSTANCE.m_unicodeVersion_; } /** * Returns a code point corresponding to the two UTF16 characters. * @param lead the lead char * @param trail the trail char * @return code point if surrogate characters are valid. * @exception IllegalArgumentException thrown when argument characters do * not form a valid codepoint * @stable ICU 2.1 */ public static int getCodePoint(char lead, char trail) { if (UTF16.isLeadSurrogate(lead) && UTF16.isTrailSurrogate(trail)) { return UCharacterProperty.getRawSupplementary(lead, trail); } throw new IllegalArgumentException("Illegal surrogate characters"); } /** * Returns the "age" of the code point.

*

The "age" is the Unicode version when the code point was first * designated (as a non-character or for Private Use) or assigned a * character. *

This can be useful to avoid emitting code points to receiving * processes that do not accept newer characters.

*

The data is from the UCD file DerivedAge.txt.

* @param ch The code point. * @return the Unicode version number * @stable ICU 2.6 */ public static VersionInfo getAge(int ch) { if (ch < MIN_VALUE || ch > MAX_VALUE) { throw new IllegalArgumentException("Codepoint out of bounds"); } return UCharacterProperty.INSTANCE.getAge(ch); } /** * Returns the property value for an Unicode property type of a code point. * Also returns binary and mask property values.

*

Unicode, especially in version 3.2, defines many more properties than * the original set in UnicodeData.txt.

*

The properties APIs are intended to reflect Unicode properties as * defined in the Unicode Character Database (UCD) and Unicode Technical * Reports (UTR). For details about the properties see * http://www.unicode.org/.

*

For names of Unicode properties see the UCD file PropertyAliases.txt. *

*
     * Sample usage:
     * int ea = UCharacter.getIntPropertyValue(c, UProperty.EAST_ASIAN_WIDTH);
     * int ideo = UCharacter.getIntPropertyValue(c, UProperty.IDEOGRAPHIC);
     * boolean b = (ideo == 1) ? true : false;
     * 
* @param ch code point to test. * @param type UProperty selector constant, identifies which binary * property to check. Must be * UProperty.BINARY_START <= type < UProperty.BINARY_LIMIT or * UProperty.INT_START <= type < UProperty.INT_LIMIT or * UProperty.MASK_START <= type < UProperty.MASK_LIMIT. * @return numeric value that is directly the property value or, * for enumerated properties, corresponds to the numeric value of * the enumerated constant of the respective property value * enumeration type (cast to enum type if necessary). * Returns 0 or 1 (for false / true) for binary Unicode properties. * Returns a bit-mask for mask properties. * Returns 0 if 'type' is out of bounds or if the Unicode version * does not have data for the property at all, or not for this code * point. * @see UProperty * @see #hasBinaryProperty * @see #getIntPropertyMinValue * @see #getIntPropertyMaxValue * @see #getUnicodeVersion * @stable ICU 2.4 */ // for BiDiBase.java public static int getIntPropertyValue(int ch, int type) { return UCharacterProperty.INSTANCE.getIntPropertyValue(ch, type); } // private constructor ----------------------------------------------- /** * Private constructor to prevent instantiation */ private UCharacter() { } /* * Copied from UCharacterEnums.java */ /** * Character type Mn * @stable ICU 2.1 */ public static final byte NON_SPACING_MARK = 6; /** * Character type Me * @stable ICU 2.1 */ public static final byte ENCLOSING_MARK = 7; /** * Character type Mc * @stable ICU 2.1 */ public static final byte COMBINING_SPACING_MARK = 8; /** * Character type count * @stable ICU 2.1 */ public static final byte CHAR_CATEGORY_COUNT = 30; /** * Directional type R * @stable ICU 2.1 */ public static final int RIGHT_TO_LEFT = 1; /** * Directional type AL * @stable ICU 2.1 */ public static final int RIGHT_TO_LEFT_ARABIC = 13; }