A list of all available normalization forms. See www.unicode.org/reports/tr15/tr15-29.html for more information about normalization.
The Unicode version that is supported by the implementation
Hangul character boundaries and properties
All the unicode whitespace
BOM (byte order mark) can also be seen as whitespace, it’s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.
The default normalization used for operations that require normalization. It can be set to any of the normalizations in NORMALIZATION_FORMS.
Example:
ActiveSupport::Multibyte::Unicode.default_normalization_form = :c
Compose decomposed characters to the composed form.
# File lib/active_support/multibyte/unicode.rb, line 167 167: def compose_codepoints(codepoints) 168: pos = 0 169: eoa = codepoints.length - 1 170: starter_pos = 0 171: starter_char = codepoints[0] 172: previous_combining_class = 1 173: while pos < eoa 174: pos += 1 175: lindex = starter_char - HANGUL_LBASE 176: # -- Hangul 177: if 0 <= lindex and lindex < HANGUL_LCOUNT 178: vindex = codepoints[starter_pos+1] - HANGUL_VBASE rescue vindex = 1 179: if 0 <= vindex and vindex < HANGUL_VCOUNT 180: tindex = codepoints[starter_pos+2] - HANGUL_TBASE rescue tindex = 1 181: if 0 <= tindex and tindex < HANGUL_TCOUNT 182: j = starter_pos + 2 183: eoa -= 2 184: else 185: tindex = 0 186: j = starter_pos + 1 187: eoa -= 1 188: end 189: codepoints[starter_pos..j] = (lindex * HANGUL_VCOUNT + vindex) * HANGUL_TCOUNT + tindex + HANGUL_SBASE 190: end 191: starter_pos += 1 192: starter_char = codepoints[starter_pos] 193: # -- Other characters 194: else 195: current_char = codepoints[pos] 196: current = database.codepoints[current_char] 197: if current.combining_class > previous_combining_class 198: if ref = database.composition_map[starter_char] 199: composition = ref[current_char] 200: else 201: composition = nil 202: end 203: unless composition.nil? 204: codepoints[starter_pos] = composition 205: starter_char = composition 206: codepoints.delete_at pos 207: eoa -= 1 208: pos -= 1 209: previous_combining_class = 1 210: else 211: previous_combining_class = current.combining_class 212: end 213: else 214: previous_combining_class = current.combining_class 215: end 216: if current.combining_class == 0 217: starter_pos = pos 218: starter_char = codepoints[pos] 219: end 220: end 221: end 222: codepoints 223: end
Decompose composed characters to the decomposed form.
# File lib/active_support/multibyte/unicode.rb, line 146 146: def decompose_codepoints(type, codepoints) 147: codepoints.inject([]) do |decomposed, cp| 148: # if it's a hangul syllable starter character 149: if HANGUL_SBASE <= cp and cp < HANGUL_SLAST 150: sindex = cp - HANGUL_SBASE 151: ncp = [] # new codepoints 152: ncp << HANGUL_LBASE + sindex / HANGUL_NCOUNT 153: ncp << HANGUL_VBASE + (sindex % HANGUL_NCOUNT) / HANGUL_TCOUNT 154: tindex = sindex % HANGUL_TCOUNT 155: ncp << (HANGUL_TBASE + tindex) unless tindex == 0 156: decomposed.concat ncp 157: # if the codepoint is decomposable in with the current decomposition type 158: elsif (ncp = database.codepoints[cp].decomp_mapping) and (!database.codepoints[cp].decomp_type || type == :compatability) 159: decomposed.concat decompose_codepoints(type, ncp.dup) 160: else 161: decomposed << cp 162: end 163: end 164: end
Reverse operation of g_unpack.
Example:
Unicode.g_pack(Unicode.g_unpack('क्षि')) # => 'क्षि'
# File lib/active_support/multibyte/unicode.rb, line 125 125: def g_pack(unpacked) 126: (unpacked.flatten).pack('U*') 127: end
Unpack the string at grapheme boundaries. Returns a list of character lists.
Example:
Unicode.g_unpack('क्षि') # => [[2325, 2381], [2359], [2367]] Unicode.g_unpack('Café') # => [[67], [97], [102], [233]]
# File lib/active_support/multibyte/unicode.rb, line 91 91: def g_unpack(string) 92: codepoints = u_unpack(string) 93: unpacked = [] 94: pos = 0 95: marker = 0 96: eoc = codepoints.length 97: while(pos < eoc) 98: pos += 1 99: previous = codepoints[pos-1] 100: current = codepoints[pos] 101: if ( 102: # CR X LF 103: ( previous == database.boundary[:cr] and current == database.boundary[:lf] ) or 104: # L X (L|V|LV|LVT) 105: ( database.boundary[:l] === previous and in_char_class?(current, [:l,:v,:lv,:lvt]) ) or 106: # (LV|V) X (V|T) 107: ( in_char_class?(previous, [:lv,:v]) and in_char_class?(current, [:v,:t]) ) or 108: # (LVT|T) X (T) 109: ( in_char_class?(previous, [:lvt,:t]) and database.boundary[:t] === current ) or 110: # X Extend 111: (database.boundary[:extend] === current) 112: ) 113: else 114: unpacked << codepoints[marker..pos-1] 115: marker = pos 116: end 117: end 118: unpacked 119: end
Detect whether the codepoint is in a certain character class. Returns true when it’s in the specified character class and false otherwise. Valid character classes are: :cr, :lf, :l, :v, :lv, :lvt and :t.
Primarily used by the grapheme cluster support.
# File lib/active_support/multibyte/unicode.rb, line 82 82: def in_char_class?(codepoint, classes) 83: classes.detect { |c| database.boundary[c] === codepoint } ? true : false 84: end
Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.
string - The string to perform normalization on.
form - The form you want to normalize in. Should be one of the following: :c, :kc, :d, or :kd. Default is ActiveSupport::Multibyte.default_normalization_form
# File lib/active_support/multibyte/unicode.rb, line 283 283: def normalize(string, form=nil) 284: form ||= @default_normalization_form 285: # See http://www.unicode.org/reports/tr15, Table 1 286: codepoints = u_unpack(string) 287: case form 288: when :d 289: reorder_characters(decompose_codepoints(:canonical, codepoints)) 290: when :c 291: compose_codepoints(reorder_characters(decompose_codepoints(:canonical, codepoints))) 292: when :kd 293: reorder_characters(decompose_codepoints(:compatability, codepoints)) 294: when :kc 295: compose_codepoints(reorder_characters(decompose_codepoints(:compatability, codepoints))) 296: else 297: raise ArgumentError, "#{form} is not a valid normalization variant", caller 298: end.pack('U*') 299: end
Re-order codepoints so the string becomes canonical.
# File lib/active_support/multibyte/unicode.rb, line 130 130: def reorder_characters(codepoints) 131: length = codepoints.length- 1 132: pos = 0 133: while pos < length do 134: cp1, cp2 = database.codepoints[codepoints[pos]], database.codepoints[codepoints[pos+1]] 135: if (cp1.combining_class > cp2.combining_class) && (cp2.combining_class > 0) 136: codepoints[pos..pos+1] = cp2.code, cp1.code 137: pos += (pos > 0 ? 1 : 1) 138: else 139: pos += 1 140: end 141: end 142: codepoints 143: end
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
Passing true will forcibly tidy all bytes, assuming that the string’s encoding is entirely CP1252 or ISO-8859-1.
# File lib/active_support/multibyte/unicode.rb, line 228 228: def tidy_bytes(string, force = false) 229: if force 230: return string.unpack("C*").map do |b| 231: tidy_byte(b) 232: end.flatten.compact.pack("C*").unpack("U*").pack("U*") 233: end 234: 235: bytes = string.unpack("C*") 236: conts_expected = 0 237: last_lead = 0 238: 239: bytes.each_index do |i| 240: 241: byte = bytes[i] 242: is_cont = byte > 127 && byte < 192 243: is_lead = byte > 191 && byte < 245 244: is_unused = byte > 240 245: is_restricted = byte > 244 246: 247: # Impossible or highly unlikely byte? Clean it. 248: if is_unused || is_restricted 249: bytes[i] = tidy_byte(byte) 250: elsif is_cont 251: # Not expecting continuation byte? Clean up. Otherwise, now expect one less. 252: conts_expected == 0 ? bytes[i] = tidy_byte(byte) : conts_expected -= 1 253: else 254: if conts_expected > 0 255: # Expected continuation, but got ASCII or leading? Clean backwards up to 256: # the leading byte. 257: (1..(i - last_lead)).each {|j| bytes[i - j] = tidy_byte(bytes[i - j])} 258: conts_expected = 0 259: end 260: if is_lead 261: # Final byte is leading? Clean it. 262: if i == bytes.length - 1 263: bytes[i] = tidy_byte(bytes.last) 264: else 265: # Valid leading byte? Expect continuations determined by position of 266: # first zero bit, with max of 3. 267: conts_expected = byte < 224 ? 1 : byte < 240 ? 2 : 3 268: last_lead = i 269: end 270: end 271: end 272: end 273: bytes.empty? ? "" : bytes.flatten.compact.pack("C*").unpack("U*").pack("U*") 274: end
Unpack the string at codepoints boundaries. Raises an EncodingError when the encoding of the string isn’t valid UTF-8.
Example:
Unicode.u_unpack('Café') # => [67, 97, 102, 233]
# File lib/active_support/multibyte/unicode.rb, line 69 69: def u_unpack(string) 70: begin 71: string.unpack 'U*' 72: rescue ArgumentError 73: raise EncodingError, 'malformed UTF-8 character' 74: end 75: end
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.