1 /* 2 * Copyright (c) 2005, 2015, Oracle and/or its affiliates. All rights reserved. 3 * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. 4 * 5 * This code is free software; you can redistribute it and/or modify it 6 * under the terms of the GNU General Public License version 2 only, as 7 * published by the Free Software Foundation. Oracle designates this 8 * particular file as subject to the "Classpath" exception as provided 9 * by Oracle in the LICENSE file that accompanied this code. 10 * 11 * This code is distributed in the hope that it will be useful, but WITHOUT 12 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 13 * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 14 * version 2 for more details (a copy is included in the LICENSE file that 15 * accompanied this code). 16 * 17 * You should have received a copy of the GNU General Public License version 18 * 2 along with this work; if not, write to the Free Software Foundation, 19 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. 20 * 21 * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA 22 * or visit www.oracle.com if you need additional information or have any 23 * questions. 24 */ 25 26 /* 27 ******************************************************************************* 28 * Copyright (C) 1996-2015, International Business Machines Corporation and 29 * others. All Rights Reserved. 30 ******************************************************************************* 31 */ 32 package sun.text.normalizer; 33 34 import java.io.IOException; 35 import java.text.ParsePosition; 36 import java.util.ArrayList; 37 import java.util.TreeSet; 38 39 /** 40 * A mutable set of Unicode characters and multicharacter strings. 41 * Objects of this class represent <em>character classes</em> used 42 * in regular expressions. A character specifies a subset of Unicode 43 * code points. Legal code points are U+0000 to U+10FFFF, inclusive. 44 * 45 * Note: method freeze() will not only make the set immutable, but 46 * also makes important methods much higher performance: 47 * contains(c), containsNone(...), span(...), spanBack(...) etc. 48 * After the object is frozen, any subsequent call that wants to change 49 * the object will throw UnsupportedOperationException. 50 * 51 * <p>The UnicodeSet class is not designed to be subclassed. 52 * 53 * <p><code>UnicodeSet</code> supports two APIs. The first is the 54 * <em>operand</em> API that allows the caller to modify the value of 55 * a <code>UnicodeSet</code> object. It conforms to Java 2's 56 * <code>java.util.Set</code> interface, although 57 * <code>UnicodeSet</code> does not actually implement that 58 * interface. All methods of <code>Set</code> are supported, with the 59 * modification that they take a character range or single character 60 * instead of an <code>Object</code>, and they take a 61 * <code>UnicodeSet</code> instead of a <code>Collection</code>. The 62 * operand API may be thought of in terms of boolean logic: a boolean 63 * OR is implemented by <code>add</code>, a boolean AND is implemented 64 * by <code>retain</code>, a boolean XOR is implemented by 65 * <code>complement</code> taking an argument, and a boolean NOT is 66 * implemented by <code>complement</code> with no argument. In terms 67 * of traditional set theory function names, <code>add</code> is a 68 * union, <code>retain</code> is an intersection, <code>remove</code> 69 * is an asymmetric difference, and <code>complement</code> with no 70 * argument is a set complement with respect to the superset range 71 * <code>MIN_VALUE-MAX_VALUE</code> 72 * 73 * <p>The second API is the 74 * <code>applyPattern()</code>/<code>toPattern()</code> API from the 75 * <code>java.text.Format</code>-derived classes. Unlike the 76 * methods that add characters, add categories, and control the logic 77 * of the set, the method <code>applyPattern()</code> sets all 78 * attributes of a <code>UnicodeSet</code> at once, based on a 79 * string pattern. 80 * 81 * <p><b>Pattern syntax</b></p> 82 * 83 * Patterns are accepted by the constructors and the 84 * <code>applyPattern()</code> methods and returned by the 85 * <code>toPattern()</code> method. These patterns follow a syntax 86 * similar to that employed by version 8 regular expression character 87 * classes. Here are some simple examples: 88 * 89 * <blockquote> 90 * <table> 91 * <tr align="top"> 92 * <td nowrap valign="top" align="left"><code>[]</code></td> 93 * <td valign="top">No characters</td> 94 * </tr><tr align="top"> 95 * <td nowrap valign="top" align="left"><code>[a]</code></td> 96 * <td valign="top">The character 'a'</td> 97 * </tr><tr align="top"> 98 * <td nowrap valign="top" align="left"><code>[ae]</code></td> 99 * <td valign="top">The characters 'a' and 'e'</td> 100 * </tr> 101 * <tr> 102 * <td nowrap valign="top" align="left"><code>[a-e]</code></td> 103 * <td valign="top">The characters 'a' through 'e' inclusive, in Unicode code 104 * point order</td> 105 * </tr> 106 * <tr> 107 * <td nowrap valign="top" align="left"><code>[\\u4E01]</code></td> 108 * <td valign="top">The character U+4E01</td> 109 * </tr> 110 * <tr> 111 * <td nowrap valign="top" align="left"><code>[a{ab}{ac}]</code></td> 112 * <td valign="top">The character 'a' and the multicharacter strings "ab" and 113 * "ac"</td> 114 * </tr> 115 * <tr> 116 * <td nowrap valign="top" align="left"><code>[\p{Lu}]</code></td> 117 * <td valign="top">All characters in the general category Uppercase Letter</td> 118 * </tr> 119 * </table> 120 * </blockquote> 121 * 122 * Any character may be preceded by a backslash in order to remove any special 123 * meaning. White space characters, as defined by the Unicode Pattern_White_Space property, are 124 * ignored, unless they are escaped. 125 * 126 * <p>Property patterns specify a set of characters having a certain 127 * property as defined by the Unicode standard. Both the POSIX-like 128 * "[:Lu:]" and the Perl-like syntax "\p{Lu}" are recognized. For a 129 * complete list of supported property patterns, see the User's Guide 130 * for UnicodeSet at 131 * <a href="http://www.icu-project.org/userguide/unicodeSet.html"> 132 * http://www.icu-project.org/userguide/unicodeSet.html</a>. 133 * Actual determination of property data is defined by the underlying 134 * Unicode database as implemented by UCharacter. 135 * 136 * <p>Patterns specify individual characters, ranges of characters, and 137 * Unicode property sets. When elements are concatenated, they 138 * specify their union. To complement a set, place a '^' immediately 139 * after the opening '['. Property patterns are inverted by modifying 140 * their delimiters; "[:^foo]" and "\P{foo}". In any other location, 141 * '^' has no special meaning. 142 * 143 * <p>Ranges are indicated by placing two a '-' between two 144 * characters, as in "a-z". This specifies the range of all 145 * characters from the left to the right, in Unicode order. If the 146 * left character is greater than or equal to the 147 * right character it is a syntax error. If a '-' occurs as the first 148 * character after the opening '[' or '[^', or if it occurs as the 149 * last character before the closing ']', then it is taken as a 150 * literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same 151 * set of three characters, 'a', 'b', and '-'. 152 * 153 * <p>Sets may be intersected using the {@literal '&'} operator or the asymmetric 154 * set difference may be taken using the '-' operator, for example, 155 * "{@code [[:L:]&[\\u0000-\\u0FFF]]}" indicates the set of all Unicode letters 156 * with values less than 4096. Operators ({@literal '&'} and '|') have equal 157 * precedence and bind left-to-right. Thus 158 * "[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to 159 * "[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for 160 * difference; intersection is commutative. 161 * 162 * <table> 163 * <tr valign=top><td nowrap><code>[a]</code><td>The set containing 'a' 164 * <tr valign=top><td nowrap><code>[a-z]</code><td>The set containing 'a' 165 * through 'z' and all letters in between, in Unicode order 166 * <tr valign=top><td nowrap><code>[^a-z]</code><td>The set containing 167 * all characters but 'a' through 'z', 168 * that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF 169 * <tr valign=top><td nowrap><code>[[<em>pat1</em>][<em>pat2</em>]]</code> 170 * <td>The union of sets specified by <em>pat1</em> and <em>pat2</em> 171 * <tr valign=top><td nowrap><code>[[<em>pat1</em>]&[<em>pat2</em>]]</code> 172 * <td>The intersection of sets specified by <em>pat1</em> and <em>pat2</em> 173 * <tr valign=top><td nowrap><code>[[<em>pat1</em>]-[<em>pat2</em>]]</code> 174 * <td>The asymmetric difference of sets specified by <em>pat1</em> and 175 * <em>pat2</em> 176 * <tr valign=top><td nowrap><code>[:Lu:] or \p{Lu}</code> 177 * <td>The set of characters having the specified 178 * Unicode property; in 179 * this case, Unicode uppercase letters 180 * <tr valign=top><td nowrap><code>[:^Lu:] or \P{Lu}</code> 181 * <td>The set of characters <em>not</em> having the given 182 * Unicode property 183 * </table> 184 * 185 * <p><b>Warning</b>: you cannot add an empty string ("") to a UnicodeSet.</p> 186 * 187 * <p><b>Formal syntax</b></p> 188 * 189 * <blockquote> 190 * <table> 191 * <tr align="top"> 192 * <td nowrap valign="top" align="right"><code>pattern := </code></td> 193 * <td valign="top"><code>('[' '^'? item* ']') | 194 * property</code></td> 195 * </tr> 196 * <tr align="top"> 197 * <td nowrap valign="top" align="right"><code>item := </code></td> 198 * <td valign="top"><code>char | (char '-' char) | pattern-expr<br> 199 * </code></td> 200 * </tr> 201 * <tr align="top"> 202 * <td nowrap valign="top" align="right"><code>pattern-expr := </code></td> 203 * <td valign="top"><code>pattern | pattern-expr pattern | 204 * pattern-expr op pattern<br> 205 * </code></td> 206 * </tr> 207 * <tr align="top"> 208 * <td nowrap valign="top" align="right"><code>op := </code></td> 209 * <td valign="top"><code>'&' | '-'<br> 210 * </code></td> 211 * </tr> 212 * <tr align="top"> 213 * <td nowrap valign="top" align="right"><code>special := </code></td> 214 * <td valign="top"><code>'[' | ']' | '-'<br> 215 * </code></td> 216 * </tr> 217 * <tr align="top"> 218 * <td nowrap valign="top" align="right"><code>char := </code></td> 219 * <td valign="top"><em>any character that is not</em><code> special<br> 220 * | ('\\' </code><em>any character</em><code>)<br> 221 * | ('\u' hex hex hex hex)<br> 222 * </code></td> 223 * </tr> 224 * <tr align="top"> 225 * <td nowrap valign="top" align="right"><code>hex := </code></td> 226 * <td valign="top"><em>any character for which 227 * </em><code>Character.digit(c, 16)</code><em> 228 * returns a non-negative result</em></td> 229 * </tr> 230 * <tr> 231 * <td nowrap valign="top" align="right"><code>property := </code></td> 232 * <td valign="top"><em>a Unicode property set pattern</em></td> 233 * </tr> 234 * </table> 235 * <br> 236 * <table border="1"> 237 * <tr> 238 * <td>Legend: <table> 239 * <tr> 240 * <td nowrap valign="top"><code>a := b</code></td> 241 * <td width="20" valign="top"> </td> 242 * <td valign="top"><code>a</code> may be replaced by <code>b</code> </td> 243 * </tr> 244 * <tr> 245 * <td nowrap valign="top"><code>a?</code></td> 246 * <td valign="top"></td> 247 * <td valign="top">zero or one instance of <code>a</code><br> 248 * </td> 249 * </tr> 250 * <tr> 251 * <td nowrap valign="top"><code>a*</code></td> 252 * <td valign="top"></td> 253 * <td valign="top">one or more instances of <code>a</code><br> 254 * </td> 255 * </tr> 256 * <tr> 257 * <td nowrap valign="top"><code>a | b</code></td> 258 * <td valign="top"></td> 259 * <td valign="top">either <code>a</code> or <code>b</code><br> 260 * </td> 261 * </tr> 262 * <tr> 263 * <td nowrap valign="top"><code>'a'</code></td> 264 * <td valign="top"></td> 265 * <td valign="top">the literal string between the quotes </td> 266 * </tr> 267 * </table> 268 * </td> 269 * </tr> 270 * </table> 271 * </blockquote> 272 * <p>To iterate over contents of UnicodeSet, the following are available: 273 * <ul><li>{@link #ranges()} to iterate through the ranges</li> 274 * <li>{@link #strings()} to iterate through the strings</li> 275 * <li>{@link #iterator()} to iterate through the entire contents in a single loop. 276 * That method is, however, not particularly efficient, since it "boxes" each code point into a String. 277 * </ul> 278 * All of the above can be used in <b>for</b> loops. 279 * The {@link com.ibm.icu.text.UnicodeSetIterator UnicodeSetIterator} can also be used, but not in <b>for</b> loops. 280 * <p>To replace, count elements, or delete spans, see {@link com.ibm.icu.text.UnicodeSetSpanner UnicodeSetSpanner}. 281 * 282 * @author Alan Liu 283 * @stable ICU 2.0 284 */ 285 class UnicodeSet { 286 287 private static final int LOW = 0x000000; // LOW <= all valid values. ZERO for codepoints 288 private static final int HIGH = 0x110000; // HIGH > all valid values. 10000 for code units. 289 // 110000 for codepoints 290 291 /** 292 * Minimum value that can be stored in a UnicodeSet. 293 * @stable ICU 2.0 294 */ 295 public static final int MIN_VALUE = LOW; 296 297 /** 298 * Maximum value that can be stored in a UnicodeSet. 299 * @stable ICU 2.0 300 */ 301 public static final int MAX_VALUE = HIGH - 1; 302 303 private int len; // length used; list may be longer to minimize reallocs 304 private int[] list; // MUST be terminated with HIGH 305 private int[] rangeList; // internal buffer 306 private int[] buffer; // internal buffer 307 308 // NOTE: normally the field should be of type SortedSet; but that is missing a public clone!! 309 // is not private so that UnicodeSetIterator can get access 310 TreeSet<String> strings = new TreeSet<String>(); 311 312 /** 313 * The pattern representation of this set. This may not be the 314 * most economical pattern. It is the pattern supplied to 315 * applyPattern(), with variables substituted and whitespace 316 * removed. For sets constructed without applyPattern(), or 317 * modified using the non-pattern API, this string will be null, 318 * indicating that toPattern() must generate a pattern 319 * representation from the inversion list. 320 */ 321 322 private static final int START_EXTRA = 16; // initial storage. Must be >= 0 323 private static final int GROW_EXTRA = START_EXTRA; // extra amount for growth. Must be >= 0 324 325 private static UnicodeSet INCLUSION = null; 326 327 private volatile BMPSet bmpSet; // The set is frozen if bmpSet or stringSpan is not null. 328 private volatile UnicodeSetStringSpan stringSpan; 329 330 //---------------------------------------------------------------- 331 // Public API 332 //---------------------------------------------------------------- 333 334 /** 335 * Constructs an empty set. 336 * @stable ICU 2.0 337 */ 338 private UnicodeSet() { 339 list = new int[1 + START_EXTRA]; 340 list[len++] = HIGH; 341 } 342 343 /** 344 * Constructs a copy of an existing set. 345 * @stable ICU 2.0 346 */ 347 private UnicodeSet(UnicodeSet other) { 348 set(other); 349 } 350 351 /** 352 * Constructs a set containing the given range. If <code>end > 353 * start</code> then an empty set is created. 354 * 355 * @param start first character, inclusive, of range 356 * @param end last character, inclusive, of range 357 * @stable ICU 2.0 358 */ 359 public UnicodeSet(int start, int end) { 360 this(); 361 complement(start, end); 362 } 363 364 /** 365 * Constructs a set from the given pattern. See the class description 366 * for the syntax of the pattern language. Whitespace is ignored. 367 * @param pattern a string specifying what characters are in the set 368 * @exception java.lang.IllegalArgumentException if the pattern contains 369 * a syntax error. 370 * @stable ICU 2.0 371 */ 372 public UnicodeSet(String pattern) { 373 this(); 374 applyPattern(pattern, null); 375 } 376 377 /** 378 * Make this object represent the same set as <code>other</code>. 379 * @param other a <code>UnicodeSet</code> whose value will be 380 * copied to this object 381 * @stable ICU 2.0 382 */ 383 public UnicodeSet set(UnicodeSet other) { 384 checkFrozen(); 385 list = other.list.clone(); 386 len = other.len; 387 strings = new TreeSet<String>(other.strings); 388 return this; 389 } 390 391 /** 392 * Returns the number of elements in this set (its cardinality) 393 * Note than the elements of a set may include both individual 394 * codepoints and strings. 395 * 396 * @return the number of elements in this set (its cardinality). 397 * @stable ICU 2.0 398 */ 399 public int size() { 400 int n = 0; 401 int count = getRangeCount(); 402 for (int i = 0; i < count; ++i) { 403 n += getRangeEnd(i) - getRangeStart(i) + 1; 404 } 405 return n + strings.size(); 406 } 407 408 // for internal use, after checkFrozen has been called 409 private UnicodeSet add_unchecked(int start, int end) { 410 if (start < MIN_VALUE || start > MAX_VALUE) { 411 throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(start, 6)); 412 } 413 if (end < MIN_VALUE || end > MAX_VALUE) { 414 throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(end, 6)); 415 } 416 if (start < end) { 417 add(range(start, end), 2, 0); 418 } else if (start == end) { 419 add(start); 420 } 421 return this; 422 } 423 424 /** 425 * Adds the specified character to this set if it is not already 426 * present. If this set already contains the specified character, 427 * the call leaves this set unchanged. 428 * @stable ICU 2.0 429 */ 430 public final UnicodeSet add(int c) { 431 checkFrozen(); 432 return add_unchecked(c); 433 } 434 435 // for internal use only, after checkFrozen has been called 436 private final UnicodeSet add_unchecked(int c) { 437 if (c < MIN_VALUE || c > MAX_VALUE) { 438 throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(c, 6)); 439 } 440 441 // find smallest i such that c < list[i] 442 // if odd, then it is IN the set 443 // if even, then it is OUT of the set 444 int i = findCodePoint(c); 445 446 // already in set? 447 if ((i & 1) != 0) return this; 448 449 // HIGH is 0x110000 450 // assert(list[len-1] == HIGH); 451 452 // empty = [HIGH] 453 // [start_0, limit_0, start_1, limit_1, HIGH] 454 455 // [..., start_k-1, limit_k-1, start_k, limit_k, ..., HIGH] 456 // ^ 457 // list[i] 458 459 // i == 0 means c is before the first range 460 461 if (c == list[i]-1) { 462 // c is before start of next range 463 list[i] = c; 464 // if we touched the HIGH mark, then add a new one 465 if (c == MAX_VALUE) { 466 ensureCapacity(len+1); 467 list[len++] = HIGH; 468 } 469 if (i > 0 && c == list[i-1]) { 470 // collapse adjacent ranges 471 472 // [..., start_k-1, c, c, limit_k, ..., HIGH] 473 // ^ 474 // list[i] 475 System.arraycopy(list, i+1, list, i-1, len-i-1); 476 len -= 2; 477 } 478 } 479 480 else if (i > 0 && c == list[i-1]) { 481 // c is after end of prior range 482 list[i-1]++; 483 // no need to chcek for collapse here 484 } 485 486 else { 487 // At this point we know the new char is not adjacent to 488 // any existing ranges, and it is not 10FFFF. 489 490 491 // [..., start_k-1, limit_k-1, start_k, limit_k, ..., HIGH] 492 // ^ 493 // list[i] 494 495 // [..., start_k-1, limit_k-1, c, c+1, start_k, limit_k, ..., HIGH] 496 // ^ 497 // list[i] 498 499 // Don't use ensureCapacity() to save on copying. 500 // NOTE: This has no measurable impact on performance, 501 // but it might help in some usage patterns. 502 if (len+2 > list.length) { 503 int[] temp = new int[len + 2 + GROW_EXTRA]; 504 if (i != 0) System.arraycopy(list, 0, temp, 0, i); 505 System.arraycopy(list, i, temp, i+2, len-i); 506 list = temp; 507 } else { 508 System.arraycopy(list, i, list, i+2, len-i); 509 } 510 511 list[i] = c; 512 list[i+1] = c+1; 513 len += 2; 514 } 515 516 return this; 517 } 518 519 /** 520 * Adds the specified multicharacter to this set if it is not already 521 * present. If this set already contains the multicharacter, 522 * the call leaves this set unchanged. 523 * Thus {@code "ch" => {"ch"}} 524 * <br><b>Warning: you cannot add an empty string ("") to a UnicodeSet.</b> 525 * @param s the source string 526 * @return this object, for chaining 527 * @stable ICU 2.0 528 */ 529 public final UnicodeSet add(CharSequence s) { 530 checkFrozen(); 531 int cp = getSingleCP(s); 532 if (cp < 0) { 533 strings.add(s.toString()); 534 } else { 535 add_unchecked(cp, cp); 536 } 537 return this; 538 } 539 540 /** 541 * Utility for getting code point from single code point CharSequence. 542 * See the public UTF16.getSingleCodePoint() 543 * @return a code point IF the string consists of a single one. 544 * otherwise returns -1. 545 * @param s to test 546 */ 547 private static int getSingleCP(CharSequence s) { 548 if (s.length() < 1) { 549 throw new IllegalArgumentException("Can't use zero-length strings in UnicodeSet"); 550 } 551 if (s.length() > 2) return -1; 552 if (s.length() == 1) return s.charAt(0); 553 554 // at this point, len = 2 555 int cp = UTF16.charAt(s, 0); 556 if (cp > 0xFFFF) { // is surrogate pair 557 return cp; 558 } 559 return -1; 560 } 561 562 /** 563 * Complements the specified range in this set. Any character in 564 * the range will be removed if it is in this set, or will be 565 * added if it is not in this set. If {@code end > start} 566 * then an empty range is complemented, leaving the set unchanged. 567 * 568 * @param start first character, inclusive, of range to be removed 569 * from this set. 570 * @param end last character, inclusive, of range to be removed 571 * from this set. 572 * @stable ICU 2.0 573 */ 574 public UnicodeSet complement(int start, int end) { 575 checkFrozen(); 576 if (start < MIN_VALUE || start > MAX_VALUE) { 577 throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(start, 6)); 578 } 579 if (end < MIN_VALUE || end > MAX_VALUE) { 580 throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(end, 6)); 581 } 582 if (start <= end) { 583 xor(range(start, end), 2, 0); 584 } 585 return this; 586 } 587 588 /** 589 * Returns true if this set contains the given character. 590 * @param c character to be checked for containment 591 * @return true if the test condition is met 592 * @stable ICU 2.0 593 */ 594 public boolean contains(int c) { 595 if (c < MIN_VALUE || c > MAX_VALUE) { 596 throw new IllegalArgumentException("Invalid code point U+" + Utility.hex(c, 6)); 597 } 598 if (bmpSet != null) { 599 return bmpSet.contains(c); 600 } 601 if (stringSpan != null) { 602 return stringSpan.contains(c); 603 } 604 605 /* 606 // Set i to the index of the start item greater than ch 607 // We know we will terminate without length test! 608 int i = -1; 609 while (true) { 610 if (c < list[++i]) break; 611 } 612 */ 613 614 int i = findCodePoint(c); 615 616 return ((i & 1) != 0); // return true if odd 617 } 618 619 /** 620 * Returns the smallest value i such that c < list[i]. Caller 621 * must ensure that c is a legal value or this method will enter 622 * an infinite loop. This method performs a binary search. 623 * @param c a character in the range MIN_VALUE..MAX_VALUE 624 * inclusive 625 * @return the smallest integer i in the range 0..len-1, 626 * inclusive, such that c < list[i] 627 */ 628 private final int findCodePoint(int c) { 629 /* Examples: 630 findCodePoint(c) 631 set list[] c=0 1 3 4 7 8 632 === ============== =========== 633 [] [110000] 0 0 0 0 0 0 634 [\u0000-\u0003] [0, 4, 110000] 1 1 1 2 2 2 635 [\u0004-\u0007] [4, 8, 110000] 0 0 0 1 1 2 636 [:all:] [0, 110000] 1 1 1 1 1 1 637 */ 638 639 // Return the smallest i such that c < list[i]. Assume 640 // list[len - 1] == HIGH and that c is legal (0..HIGH-1). 641 if (c < list[0]) return 0; 642 // High runner test. c is often after the last range, so an 643 // initial check for this condition pays off. 644 if (len >= 2 && c >= list[len-2]) return len-1; 645 int lo = 0; 646 int hi = len - 1; 647 // invariant: c >= list[lo] 648 // invariant: c < list[hi] 649 for (;;) { 650 int i = (lo + hi) >>> 1; 651 if (i == lo) return hi; 652 if (c < list[i]) { 653 hi = i; 654 } else { 655 lo = i; 656 } 657 } 658 } 659 660 /** 661 * Retains only the elements in this set that are contained in the 662 * specified set. In other words, removes from this set all of 663 * its elements that are not contained in the specified set. This 664 * operation effectively modifies this set so that its value is 665 * the <i>intersection</i> of the two sets. 666 * 667 * @param c set that defines which elements this set will retain. 668 * @stable ICU 2.0 669 */ 670 public UnicodeSet retainAll(UnicodeSet c) { 671 checkFrozen(); 672 retain(c.list, c.len, 0); 673 strings.retainAll(c.strings); 674 return this; 675 } 676 677 /** 678 * Removes all of the elements from this set. This set will be 679 * empty after this call returns. 680 * @stable ICU 2.0 681 */ 682 public UnicodeSet clear() { 683 checkFrozen(); 684 list[0] = HIGH; 685 len = 1; 686 strings.clear(); 687 return this; 688 } 689 690 /** 691 * Iteration method that returns the number of ranges contained in 692 * this set. 693 * @see #getRangeStart 694 * @see #getRangeEnd 695 * @stable ICU 2.0 696 */ 697 public int getRangeCount() { 698 return len/2; 699 } 700 701 /** 702 * Iteration method that returns the first character in the 703 * specified range of this set. 704 * @exception ArrayIndexOutOfBoundsException if index is outside 705 * the range <code>0..getRangeCount()-1</code> 706 * @see #getRangeCount 707 * @see #getRangeEnd 708 * @stable ICU 2.0 709 */ 710 public int getRangeStart(int index) { 711 return list[index*2]; 712 } 713 714 /** 715 * Iteration method that returns the last character in the 716 * specified range of this set. 717 * @exception ArrayIndexOutOfBoundsException if index is outside 718 * the range <code>0..getRangeCount()-1</code> 719 * @see #getRangeStart 720 * @see #getRangeEnd 721 * @stable ICU 2.0 722 */ 723 public int getRangeEnd(int index) { 724 return (list[index*2 + 1] - 1); 725 } 726 727 //---------------------------------------------------------------- 728 // Implementation: Pattern parsing 729 //---------------------------------------------------------------- 730 731 /** 732 * Parses the given pattern, starting at the given position. The character 733 * at pattern.charAt(pos.getIndex()) must be '[', or the parse fails. 734 * Parsing continues until the corresponding closing ']'. If a syntax error 735 * is encountered between the opening and closing brace, the parse fails. 736 * Upon return from a successful parse, the ParsePosition is updated to 737 * point to the character following the closing ']', and an inversion 738 * list for the parsed pattern is returned. This method 739 * calls itself recursively to parse embedded subpatterns. 740 * 741 * @param pattern the string containing the pattern to be parsed. The 742 * portion of the string from pos.getIndex(), which must be a '[', to the 743 * corresponding closing ']', is parsed. 744 * @param pos upon entry, the position at which to being parsing. The 745 * character at pattern.charAt(pos.getIndex()) must be a '['. Upon return 746 * from a successful parse, pos.getIndex() is either the character after the 747 * closing ']' of the parsed pattern, or pattern.length() if the closing ']' 748 * is the last character of the pattern string. 749 * @return an inversion list for the parsed substring 750 * of <code>pattern</code> 751 * @exception java.lang.IllegalArgumentException if the parse fails. 752 */ 753 private UnicodeSet applyPattern(String pattern, 754 ParsePosition pos) { 755 if ("[:age=3.2:]".equals(pattern)) { 756 checkFrozen(); 757 VersionInfo version = VersionInfo.getInstance("3.2"); 758 applyFilter(new VersionFilter(version), UCharacterProperty.SRC_PROPSVEC); 759 } else { 760 throw new IllegalStateException("UnicodeSet.applyPattern(unexpected pattern " 761 + pattern + ")"); 762 } 763 764 return this; 765 } 766 767 //---------------------------------------------------------------- 768 // Implementation: Utility methods 769 //---------------------------------------------------------------- 770 771 private void ensureCapacity(int newLen) { 772 if (newLen <= list.length) return; 773 int[] temp = new int[newLen + GROW_EXTRA]; 774 System.arraycopy(list, 0, temp, 0, len); 775 list = temp; 776 } 777 778 private void ensureBufferCapacity(int newLen) { 779 if (buffer != null && newLen <= buffer.length) return; 780 buffer = new int[newLen + GROW_EXTRA]; 781 } 782 783 /** 784 * Assumes start <= end. 785 */ 786 private int[] range(int start, int end) { 787 if (rangeList == null) { 788 rangeList = new int[] { start, end+1, HIGH }; 789 } else { 790 rangeList[0] = start; 791 rangeList[1] = end+1; 792 } 793 return rangeList; 794 } 795 796 //---------------------------------------------------------------- 797 // Implementation: Fundamental operations 798 //---------------------------------------------------------------- 799 800 // polarity = 0, 3 is normal: x xor y 801 // polarity = 1, 2: x xor ~y == x === y 802 803 private UnicodeSet xor(int[] other, int otherLen, int polarity) { 804 ensureBufferCapacity(len + otherLen); 805 int i = 0, j = 0, k = 0; 806 int a = list[i++]; 807 int b; 808 if (polarity == 1 || polarity == 2) { 809 b = LOW; 810 if (other[j] == LOW) { // skip base if already LOW 811 ++j; 812 b = other[j]; 813 } 814 } else { 815 b = other[j++]; 816 } 817 // simplest of all the routines 818 // sort the values, discarding identicals! 819 while (true) { 820 if (a < b) { 821 buffer[k++] = a; 822 a = list[i++]; 823 } else if (b < a) { 824 buffer[k++] = b; 825 b = other[j++]; 826 } else if (a != HIGH) { // at this point, a == b 827 // discard both values! 828 a = list[i++]; 829 b = other[j++]; 830 } else { // DONE! 831 buffer[k++] = HIGH; 832 len = k; 833 break; 834 } 835 } 836 // swap list and buffer 837 int[] temp = list; 838 list = buffer; 839 buffer = temp; 840 return this; 841 } 842 843 // polarity = 0 is normal: x union y 844 // polarity = 2: x union ~y 845 // polarity = 1: ~x union y 846 // polarity = 3: ~x union ~y 847 848 private UnicodeSet add(int[] other, int otherLen, int polarity) { 849 ensureBufferCapacity(len + otherLen); 850 int i = 0, j = 0, k = 0; 851 int a = list[i++]; 852 int b = other[j++]; 853 // change from xor is that we have to check overlapping pairs 854 // polarity bit 1 means a is second, bit 2 means b is. 855 main: 856 while (true) { 857 switch (polarity) { 858 case 0: // both first; take lower if unequal 859 if (a < b) { // take a 860 // Back up over overlapping ranges in buffer[] 861 if (k > 0 && a <= buffer[k-1]) { 862 // Pick latter end value in buffer[] vs. list[] 863 a = max(list[i], buffer[--k]); 864 } else { 865 // No overlap 866 buffer[k++] = a; 867 a = list[i]; 868 } 869 i++; // Common if/else code factored out 870 polarity ^= 1; 871 } else if (b < a) { // take b 872 if (k > 0 && b <= buffer[k-1]) { 873 b = max(other[j], buffer[--k]); 874 } else { 875 buffer[k++] = b; 876 b = other[j]; 877 } 878 j++; 879 polarity ^= 2; 880 } else { // a == b, take a, drop b 881 if (a == HIGH) break main; 882 // This is symmetrical; it doesn't matter if 883 // we backtrack with a or b. - liu 884 if (k > 0 && a <= buffer[k-1]) { 885 a = max(list[i], buffer[--k]); 886 } else { 887 // No overlap 888 buffer[k++] = a; 889 a = list[i]; 890 } 891 i++; 892 polarity ^= 1; 893 b = other[j++]; polarity ^= 2; 894 } 895 break; 896 case 3: // both second; take higher if unequal, and drop other 897 if (b <= a) { // take a 898 if (a == HIGH) break main; 899 buffer[k++] = a; 900 } else { // take b 901 if (b == HIGH) break main; 902 buffer[k++] = b; 903 } 904 a = list[i++]; polarity ^= 1; // factored common code 905 b = other[j++]; polarity ^= 2; 906 break; 907 case 1: // a second, b first; if b < a, overlap 908 if (a < b) { // no overlap, take a 909 buffer[k++] = a; a = list[i++]; polarity ^= 1; 910 } else if (b < a) { // OVERLAP, drop b 911 b = other[j++]; polarity ^= 2; 912 } else { // a == b, drop both! 913 if (a == HIGH) break main; 914 a = list[i++]; polarity ^= 1; 915 b = other[j++]; polarity ^= 2; 916 } 917 break; 918 case 2: // a first, b second; if a < b, overlap 919 if (b < a) { // no overlap, take b 920 buffer[k++] = b; b = other[j++]; polarity ^= 2; 921 } else if (a < b) { // OVERLAP, drop a 922 a = list[i++]; polarity ^= 1; 923 } else { // a == b, drop both! 924 if (a == HIGH) break main; 925 a = list[i++]; polarity ^= 1; 926 b = other[j++]; polarity ^= 2; 927 } 928 break; 929 } 930 } 931 buffer[k++] = HIGH; // terminate 932 len = k; 933 // swap list and buffer 934 int[] temp = list; 935 list = buffer; 936 buffer = temp; 937 return this; 938 } 939 940 // polarity = 0 is normal: x intersect y 941 // polarity = 2: x intersect ~y == set-minus 942 // polarity = 1: ~x intersect y 943 // polarity = 3: ~x intersect ~y 944 945 private UnicodeSet retain(int[] other, int otherLen, int polarity) { 946 ensureBufferCapacity(len + otherLen); 947 int i = 0, j = 0, k = 0; 948 int a = list[i++]; 949 int b = other[j++]; 950 // change from xor is that we have to check overlapping pairs 951 // polarity bit 1 means a is second, bit 2 means b is. 952 main: 953 while (true) { 954 switch (polarity) { 955 case 0: // both first; drop the smaller 956 if (a < b) { // drop a 957 a = list[i++]; polarity ^= 1; 958 } else if (b < a) { // drop b 959 b = other[j++]; polarity ^= 2; 960 } else { // a == b, take one, drop other 961 if (a == HIGH) break main; 962 buffer[k++] = a; a = list[i++]; polarity ^= 1; 963 b = other[j++]; polarity ^= 2; 964 } 965 break; 966 case 3: // both second; take lower if unequal 967 if (a < b) { // take a 968 buffer[k++] = a; a = list[i++]; polarity ^= 1; 969 } else if (b < a) { // take b 970 buffer[k++] = b; b = other[j++]; polarity ^= 2; 971 } else { // a == b, take one, drop other 972 if (a == HIGH) break main; 973 buffer[k++] = a; a = list[i++]; polarity ^= 1; 974 b = other[j++]; polarity ^= 2; 975 } 976 break; 977 case 1: // a second, b first; 978 if (a < b) { // NO OVERLAP, drop a 979 a = list[i++]; polarity ^= 1; 980 } else if (b < a) { // OVERLAP, take b 981 buffer[k++] = b; b = other[j++]; polarity ^= 2; 982 } else { // a == b, drop both! 983 if (a == HIGH) break main; 984 a = list[i++]; polarity ^= 1; 985 b = other[j++]; polarity ^= 2; 986 } 987 break; 988 case 2: // a first, b second; if a < b, overlap 989 if (b < a) { // no overlap, drop b 990 b = other[j++]; polarity ^= 2; 991 } else if (a < b) { // OVERLAP, take a 992 buffer[k++] = a; a = list[i++]; polarity ^= 1; 993 } else { // a == b, drop both! 994 if (a == HIGH) break main; 995 a = list[i++]; polarity ^= 1; 996 b = other[j++]; polarity ^= 2; 997 } 998 break; 999 } 1000 } 1001 buffer[k++] = HIGH; // terminate 1002 len = k; 1003 // swap list and buffer 1004 int[] temp = list; 1005 list = buffer; 1006 buffer = temp; 1007 return this; 1008 } 1009 1010 private static final int max(int a, int b) { 1011 return (a > b) ? a : b; 1012 } 1013 1014 //---------------------------------------------------------------- 1015 // Generic filter-based scanning code 1016 //---------------------------------------------------------------- 1017 1018 private static interface Filter { 1019 boolean contains(int codePoint); 1020 } 1021 1022 private static final VersionInfo NO_VERSION = VersionInfo.getInstance(0, 0, 0, 0); 1023 1024 private static class VersionFilter implements Filter { 1025 VersionInfo version; 1026 VersionFilter(VersionInfo version) { this.version = version; } 1027 public boolean contains(int ch) { 1028 VersionInfo v = UCharacter.getAge(ch); 1029 // Reference comparison ok; VersionInfo caches and reuses 1030 // unique objects. 1031 return v != NO_VERSION && 1032 v.compareTo(version) <= 0; 1033 } 1034 } 1035 1036 private static synchronized UnicodeSet getInclusions(int src) { 1037 if (src != UCharacterProperty.SRC_PROPSVEC) { 1038 throw new IllegalStateException("UnicodeSet.getInclusions(unknown src "+src+")"); 1039 } 1040 1041 if (INCLUSION == null) { 1042 UnicodeSet incl = new UnicodeSet(); 1043 UCharacterProperty.INSTANCE.upropsvec_addPropertyStarts(incl); 1044 INCLUSION = incl; 1045 } 1046 return INCLUSION; 1047 } 1048 1049 /** 1050 * Generic filter-based scanning code for UCD property UnicodeSets. 1051 */ 1052 private UnicodeSet applyFilter(Filter filter, int src) { 1053 // Logically, walk through all Unicode characters, noting the start 1054 // and end of each range for which filter.contain(c) is 1055 // true. Add each range to a set. 1056 // 1057 // To improve performance, use an inclusions set which 1058 // encodes information about character ranges that are known 1059 // to have identical properties. 1060 // getInclusions(src) contains exactly the first characters of 1061 // same-value ranges for the given properties "source". 1062 1063 clear(); 1064 1065 int startHasProperty = -1; 1066 UnicodeSet inclusions = getInclusions(src); 1067 int limitRange = inclusions.getRangeCount(); 1068 1069 for (int j=0; j<limitRange; ++j) { 1070 // get current range 1071 int start = inclusions.getRangeStart(j); 1072 int end = inclusions.getRangeEnd(j); 1073 1074 // for all the code points in the range, process 1075 for (int ch = start; ch <= end; ++ch) { 1076 // only add to the unicodeset on inflection points -- 1077 // where the hasProperty value changes to false 1078 if (filter.contains(ch)) { 1079 if (startHasProperty < 0) { 1080 startHasProperty = ch; 1081 } 1082 } else if (startHasProperty >= 0) { 1083 add_unchecked(startHasProperty, ch-1); 1084 startHasProperty = -1; 1085 } 1086 } 1087 } 1088 if (startHasProperty >= 0) { 1089 add_unchecked(startHasProperty, 0x10FFFF); 1090 } 1091 1092 return this; 1093 } 1094 1095 /** 1096 * Is this frozen, according to the Freezable interface? 1097 * 1098 * @return value 1099 * @stable ICU 3.8 1100 */ 1101 public boolean isFrozen() { 1102 return (bmpSet != null || stringSpan != null); 1103 } 1104 1105 /** 1106 * Freeze this class, according to the Freezable interface. 1107 * 1108 * @return this 1109 * @stable ICU 4.4 1110 */ 1111 public UnicodeSet freeze() { 1112 if (!isFrozen()) { 1113 // Do most of what compact() does before freezing because 1114 // compact() will not work when the set is frozen. 1115 // Small modification: Don't shrink if the savings would be tiny (<=GROW_EXTRA). 1116 1117 // Delete buffer first to defragment memory less. 1118 buffer = null; 1119 if (list.length > (len + GROW_EXTRA)) { 1120 // Make the capacity equal to len or 1. 1121 // We don't want to realloc of 0 size. 1122 int capacity = (len == 0) ? 1 : len; 1123 int[] oldList = list; 1124 list = new int[capacity]; 1125 for (int i = capacity; i-- > 0;) { 1126 list[i] = oldList[i]; 1127 } 1128 } 1129 1130 // Optimize contains() and span() and similar functions. 1131 if (!strings.isEmpty()) { 1132 stringSpan = new UnicodeSetStringSpan(this, new ArrayList<String>(strings), UnicodeSetStringSpan.ALL); 1133 } 1134 if (stringSpan == null || !stringSpan.needsStringSpanUTF16()) { 1135 // Optimize for code point spans. 1136 // There are no strings, or 1137 // all strings are irrelevant for span() etc. because 1138 // all of each string's code points are contained in this set. 1139 // However, fully contained strings are relevant for spanAndCount(), 1140 // so we create both objects. 1141 bmpSet = new BMPSet(list, len); 1142 } 1143 } 1144 return this; 1145 } 1146 1147 /** 1148 * Span a string using this UnicodeSet. 1149 * <p>To replace, count elements, or delete spans, see {@link com.ibm.icu.text.UnicodeSetSpanner UnicodeSetSpanner}. 1150 * @param s The string to be spanned 1151 * @param spanCondition The span condition 1152 * @return the length of the span 1153 * @stable ICU 4.4 1154 */ 1155 public int span(CharSequence s, SpanCondition spanCondition) { 1156 return span(s, 0, spanCondition); 1157 } 1158 1159 /** 1160 * Span a string using this UnicodeSet. 1161 * If the start index is less than 0, span will start from 0. 1162 * If the start index is greater than the string length, span returns the string length. 1163 * <p>To replace, count elements, or delete spans, see {@link com.ibm.icu.text.UnicodeSetSpanner UnicodeSetSpanner}. 1164 * @param s The string to be spanned 1165 * @param start The start index that the span begins 1166 * @param spanCondition The span condition 1167 * @return the string index which ends the span (i.e. exclusive) 1168 * @stable ICU 4.4 1169 */ 1170 public int span(CharSequence s, int start, SpanCondition spanCondition) { 1171 int end = s.length(); 1172 if (start < 0) { 1173 start = 0; 1174 } else if (start >= end) { 1175 return end; 1176 } 1177 if (bmpSet != null) { 1178 // Frozen set without strings, or no string is relevant for span(). 1179 return bmpSet.span(s, start, spanCondition, null); 1180 } 1181 if (stringSpan != null) { 1182 return stringSpan.span(s, start, spanCondition); 1183 } else if (!strings.isEmpty()) { 1184 int which = spanCondition == SpanCondition.NOT_CONTAINED ? UnicodeSetStringSpan.FWD_UTF16_NOT_CONTAINED 1185 : UnicodeSetStringSpan.FWD_UTF16_CONTAINED; 1186 UnicodeSetStringSpan strSpan = new UnicodeSetStringSpan(this, new ArrayList<String>(strings), which); 1187 if (strSpan.needsStringSpanUTF16()) { 1188 return strSpan.span(s, start, spanCondition); 1189 } 1190 } 1191 1192 return spanCodePointsAndCount(s, start, spanCondition, null); 1193 } 1194 1195 /** 1196 * Same as span() but also counts the smallest number of set elements on any path across the span. 1197 * <p>To replace, count elements, or delete spans, see {@link com.ibm.icu.text.UnicodeSetSpanner UnicodeSetSpanner}. 1198 * @param outCount An output-only object (must not be null) for returning the count. 1199 * @return the limit (exclusive end) of the span 1200 */ 1201 public int spanAndCount(CharSequence s, int start, SpanCondition spanCondition, OutputInt outCount) { 1202 if (outCount == null) { 1203 throw new IllegalArgumentException("outCount must not be null"); 1204 } 1205 int end = s.length(); 1206 if (start < 0) { 1207 start = 0; 1208 } else if (start >= end) { 1209 return end; 1210 } 1211 if (stringSpan != null) { 1212 // We might also have bmpSet != null, 1213 // but fully-contained strings are relevant for counting elements. 1214 return stringSpan.spanAndCount(s, start, spanCondition, outCount); 1215 } else if (bmpSet != null) { 1216 return bmpSet.span(s, start, spanCondition, outCount); 1217 } else if (!strings.isEmpty()) { 1218 int which = spanCondition == SpanCondition.NOT_CONTAINED ? UnicodeSetStringSpan.FWD_UTF16_NOT_CONTAINED 1219 : UnicodeSetStringSpan.FWD_UTF16_CONTAINED; 1220 which |= UnicodeSetStringSpan.WITH_COUNT; 1221 UnicodeSetStringSpan strSpan = new UnicodeSetStringSpan(this, new ArrayList<String>(strings), which); 1222 return strSpan.spanAndCount(s, start, spanCondition, outCount); 1223 } 1224 1225 return spanCodePointsAndCount(s, start, spanCondition, outCount); 1226 } 1227 1228 private int spanCodePointsAndCount(CharSequence s, int start, 1229 SpanCondition spanCondition, OutputInt outCount) { 1230 // Pin to 0/1 values. 1231 boolean spanContained = (spanCondition != SpanCondition.NOT_CONTAINED); 1232 1233 int c; 1234 int next = start; 1235 int length = s.length(); 1236 int count = 0; 1237 do { 1238 c = Character.codePointAt(s, next); 1239 if (spanContained != contains(c)) { 1240 break; 1241 } 1242 ++count; 1243 next += Character.charCount(c); 1244 } while (next < length); 1245 if (outCount != null) { outCount.value = count; } 1246 return next; 1247 } 1248 1249 /** 1250 * Span a string backwards (from the fromIndex) using this UnicodeSet. 1251 * If the fromIndex is less than 0, spanBack will return 0. 1252 * If fromIndex is greater than the string length, spanBack will start from the string length. 1253 * <p>To replace, count elements, or delete spans, see {@link com.ibm.icu.text.UnicodeSetSpanner UnicodeSetSpanner}. 1254 * @param s The string to be spanned 1255 * @param fromIndex The index of the char (exclusive) that the string should be spanned backwards 1256 * @param spanCondition The span condition 1257 * @return The string index which starts the span (i.e. inclusive). 1258 * @stable ICU 4.4 1259 */ 1260 public int spanBack(CharSequence s, int fromIndex, SpanCondition spanCondition) { 1261 if (fromIndex <= 0) { 1262 return 0; 1263 } 1264 if (fromIndex > s.length()) { 1265 fromIndex = s.length(); 1266 } 1267 if (bmpSet != null) { 1268 // Frozen set without strings, or no string is relevant for spanBack(). 1269 return bmpSet.spanBack(s, fromIndex, spanCondition); 1270 } 1271 if (stringSpan != null) { 1272 return stringSpan.spanBack(s, fromIndex, spanCondition); 1273 } else if (!strings.isEmpty()) { 1274 int which = (spanCondition == SpanCondition.NOT_CONTAINED) 1275 ? UnicodeSetStringSpan.BACK_UTF16_NOT_CONTAINED 1276 : UnicodeSetStringSpan.BACK_UTF16_CONTAINED; 1277 UnicodeSetStringSpan strSpan = new UnicodeSetStringSpan(this, new ArrayList<String>(strings), which); 1278 if (strSpan.needsStringSpanUTF16()) { 1279 return strSpan.spanBack(s, fromIndex, spanCondition); 1280 } 1281 } 1282 1283 // Pin to 0/1 values. 1284 boolean spanContained = (spanCondition != SpanCondition.NOT_CONTAINED); 1285 1286 int c; 1287 int prev = fromIndex; 1288 do { 1289 c = Character.codePointBefore(s, prev); 1290 if (spanContained != contains(c)) { 1291 break; 1292 } 1293 prev -= Character.charCount(c); 1294 } while (prev > 0); 1295 return prev; 1296 } 1297 1298 /** 1299 * Clone a thawed version of this class, according to the Freezable interface. 1300 * @return the clone, not frozen 1301 * @stable ICU 4.4 1302 */ 1303 public UnicodeSet cloneAsThawed() { 1304 UnicodeSet result = new UnicodeSet(this); 1305 assert !result.isFrozen(); 1306 return result; 1307 } 1308 1309 // internal function 1310 private void checkFrozen() { 1311 if (isFrozen()) { 1312 throw new UnsupportedOperationException("Attempt to modify frozen object"); 1313 } 1314 } 1315 1316 /** 1317 * Argument values for whether span() and similar functions continue while the current character is contained vs. 1318 * not contained in the set. 1319 * <p> 1320 * The functionality is straightforward for sets with only single code points, without strings (which is the common 1321 * case): 1322 * <ul> 1323 * <li>CONTAINED and SIMPLE work the same. 1324 * <li>CONTAINED and SIMPLE are inverses of NOT_CONTAINED. 1325 * <li>span() and spanBack() partition any string the 1326 * same way when alternating between span(NOT_CONTAINED) and span(either "contained" condition). 1327 * <li>Using a 1328 * complemented (inverted) set and the opposite span conditions yields the same results. 1329 * </ul> 1330 * When a set contains multi-code point strings, then these statements may not be true, depending on the strings in 1331 * the set (for example, whether they overlap with each other) and the string that is processed. For a set with 1332 * strings: 1333 * <ul> 1334 * <li>The complement of the set contains the opposite set of code points, but the same set of strings. 1335 * Therefore, complementing both the set and the span conditions may yield different results. 1336 * <li>When starting spans 1337 * at different positions in a string (span(s, ...) vs. span(s+1, ...)) the ends of the spans may be different 1338 * because a set string may start before the later position. 1339 * <li>span(SIMPLE) may be shorter than 1340 * span(CONTAINED) because it will not recursively try all possible paths. For example, with a set which 1341 * contains the three strings "xy", "xya" and "ax", span("xyax", CONTAINED) will return 4 but span("xyax", 1342 * SIMPLE) will return 3. span(SIMPLE) will never be longer than span(CONTAINED). 1343 * <li>With either "contained" condition, span() and spanBack() may partition a string in different ways. For example, 1344 * with a set which contains the two strings "ab" and "ba", and when processing the string "aba", span() will yield 1345 * contained/not-contained boundaries of { 0, 2, 3 } while spanBack() will yield boundaries of { 0, 1, 3 }. 1346 * </ul> 1347 * Note: If it is important to get the same boundaries whether iterating forward or backward through a string, then 1348 * either only span() should be used and the boundaries cached for backward operation, or an ICU BreakIterator could 1349 * be used. 1350 * <p> 1351 * Note: Unpaired surrogates are treated like surrogate code points. Similarly, set strings match only on code point 1352 * boundaries, never in the middle of a surrogate pair. 1353 * 1354 * @stable ICU 4.4 1355 */ 1356 public enum SpanCondition { 1357 /** 1358 * Continues a span() while there is no set element at the current position. 1359 * Increments by one code point at a time. 1360 * Stops before the first set element (character or string). 1361 * (For code points only, this is like while contains(current)==false). 1362 * <p> 1363 * When span() returns, the substring between where it started and the position it returned consists only of 1364 * characters that are not in the set, and none of its strings overlap with the span. 1365 * 1366 * @stable ICU 4.4 1367 */ 1368 NOT_CONTAINED, 1369 1370 /** 1371 * Spans the longest substring that is a concatenation of set elements (characters or strings). 1372 * (For characters only, this is like while contains(current)==true). 1373 * <p> 1374 * When span() returns, the substring between where it started and the position it returned consists only of set 1375 * elements (characters or strings) that are in the set. 1376 * <p> 1377 * If a set contains strings, then the span will be the longest substring for which there 1378 * exists at least one non-overlapping concatenation of set elements (characters or strings). 1379 * This is equivalent to a POSIX regular expression for <code>(OR of each set element)*</code>. 1380 * (Java/ICU/Perl regex stops at the first match of an OR.) 1381 * 1382 * @stable ICU 4.4 1383 */ 1384 CONTAINED, 1385 1386 /** 1387 * Continues a span() while there is a set element at the current position. 1388 * Increments by the longest matching element at each position. 1389 * (For characters only, this is like while contains(current)==true). 1390 * <p> 1391 * When span() returns, the substring between where it started and the position it returned consists only of set 1392 * elements (characters or strings) that are in the set. 1393 * <p> 1394 * If a set only contains single characters, then this is the same as CONTAINED. 1395 * <p> 1396 * If a set contains strings, then the span will be the longest substring with a match at each position with the 1397 * longest single set element (character or string). 1398 * <p> 1399 * Use this span condition together with other longest-match algorithms, such as ICU converters 1400 * (ucnv_getUnicodeSet()). 1401 * 1402 * @stable ICU 4.4 1403 */ 1404 SIMPLE, 1405 } 1406 1407 }