Here is the proposed change to address the "broken canonical equivalent support" issue listed in JEP-111 [3] CanonicalEquivalent: https://bugs.openjdk.java.net/browse/JDK-4916384 https://bugs.openjdk.java.net/browse/JDK-4867170 https://bugs.openjdk.java.net/browse/JDK-6995635 https://bugs.openjdk.java.net/browse/JDK-6728861 https://bugs.openjdk.java.net/browse/JDK-6736245 https://bugs.openjdk.java.net/browse/JDK-7080302 The current regex CANON_EQ support is kinda of broken, especially the implementation for the character class construct. It simply does not work as expected, as reported in various issues listed above. How does it "works" now when the CANON_EQ flag is set? the current implementation (1) first converts the whole pattern into Normalizer.Form.NFD form, for example for the greek extended character "\u1f80", we convert it to its nfd form as \u1f80 -> \u03b1\u0313\u0345 which has a base character \u03b1 (small alpha) followed by two non-spacing_mark characters \u0313 (combining comma above) and \u0345 (greek iota below) (2) then we generate all the possible "permutations" of the characters inside the nfd string (based on the unicode nfd/nfc normalization rules, the base character stays where it is, those non-space-mark characters can be in any order for NOT normalized text), which includes the possible new "combination" of individual each characters. \u3b1\u313\u345 \u3b1\u345\u313 \u1f00\u345 (new combination \u3b1\u0313 -> \u1f00) \u1fb3\u313 (new combination \u3b1\u0345 -> \u1fb3) \u1f80 (3) finally a pure group is constructed with the permutations, which will match any canonical equivalences, to replace the original single \u1f80 (?:\u1f80|\u3b1\u313\u345|\u1f00\u345|\u3b1\u345\u313|\u1fb3\u313) The resulting pure group, though looks tedious, can match all the canonical equivalences (which are literally listed as the "alternation" inside the pure group construct) of the greek character \u1f80, especially in "literal" case (slice of characters). For example pattern "A\u1f80B" matches successfully for input like "A\u3b1\u313\u345B" "A\u3b1\u345\u313B" "A\u1f00\u345B" "A\u1fb3\u313B" "A\u1f80B" And it works fine even you put it inside a character class construct [...] The pattern "[\u1f80]" can successfully find its corresponding canonical equivalences from the above input strings. But it starts to fail when you try a little more "complicated" character class, for example, the negation [^\u1f80A] does not match A but matches \u1f80 and all of its canonical equivalences. Range [\u1f80-\u1f82] matches \u1f80, \u1f82 and their canonical equivalences, as expected but it doesn't match \u1f81 (and its canonical equivalences). The reason behind this is because the current implementation converts the "character class" the same way as it does for the "slice of text", so [^\u1f80A] --> (?:[^A]|\u3b1\u313\u345|\u1f00\u345|\u1f80|\u3b1\u345\u313|\u1fb3\u313|\u1f80) [\u1f80-\u1f82] --> (?:[-]|\u3b1\u313\u345|\u1f00\u345|\u1f80|\u3b1\u345\u313|\u1fb3\u313|\u1f80|\u3b1\u313\u300\u345|...) which really does not match what the original regex means to be. You just can't simply extract those composed characters out and append them as alternations at the end, for character class. The proposed changes here are (1) instead of normalizing everything into nfd, normalizing the character class part into nfc, as the "character class" really needs to match a "character", composed, if possible. Which can be interpreted as to match a "cluster of grapheme" that can be normalized into a "nfc" that matches this "character". For example if you have a cluster of \u03b1\u0314\u0345" inside a character class, you really mean to match character \u1f80 and/its canonical equivaliences. (2) instead of trying to generate the permutations, create the alternation and put it into an appropriate place inside class (logically), we now use a special "Node", the NFDCharProperty to do the matching work. The NFDCharProperty tries to match a grapheme cluster (nfc greedly, then backtrack) against the character class. So for character class [\u1f80] or [\u03b1\u0313\u0345] the resulting "normalized "pattern is just [\u1f80] and for the negation the "normalized" pattern is a normal [^\u1f80] for both [^\u1f80] and [^\u03b1\u0313\u0345] When matching, for input cluster "\u03b1\u0313\u0345", we will nfc it first and then match it against the character class. If the input cluster has more tailing "non-spacing-mark" character then it should (which might not be nfc-ed to match the [\u1f80]) we backtrack one character, then nfc again and try the match again. It appears this approach fixed most of the troubles we have in character class. While the CANON_EQ support for character class is the main issue we want to address this time, there are also couple other problems reported in the past, we are fixing them together here. (1) the current implement only supports base+non_spacing_marks CANON_EQ, the the canonical equivalence of decomposed hangul (jamos) and composed hangl syllables is NOT supported, for example "\u1100\u1161" vs "\uac00". fixed. (2) Character in Unicode composition exclusion table does not match itself, as reported in JDK-6736245. (special composition sample, nfd(\u2adc) -> \u2add\u0338 nfc(\u2add\u0338) -> \u2add\u0338 (NOT back to \u2adc) fixed. (3) regex compiling syntax error/exception when compile certain regex, for example "(\u00e9)" triggers Exception in thread "main" java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 11 ((?:é)|é)|e)́) ^ fixed. (4) the canonical equivalence does not work for the property class "\\p{IsGreek}" matches "\u1f80" "\\p{IsGreek}" but does match "\u1f00\u0345\u0300" work as expected now. thanks, Sherman [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html [2] http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039404.html [3] http://openjdk.java.net/jeps/111