Running an oracle in multibyte unicode storage like
AL32UTF8, disregarding the char and byte column length topic, is actually no different from the old days single byte storage, e.g. in
WE8MSWIN1252. However, any job that includes sort of character conversion in terms of character, decimal and hex reprasentations, does require at least a basic understanding of available unicode storage options and sql functions with oracle. To me, the main reason of common problems is the mismatch being imposed by oracle’s impure layout of the sql functions
unistr concerning the database and the national characterset.
The following has been executed on a 11gR2 on win64 using these database and the national characterset storage options.
SQL> select * from NLS_database_PARAMETERS where parameter like '%CHARACTERSET%'; PARAMETER VALUE --------- ----- NLS_CHARACTERSET AL32UTF8 NLS_NCHAR_CHARACTERSET AL16UTF16
Necessary test data only needs to comprise two rows, featuring two characters, one below and one above the 7 bit ascii code point border of (dec) 127 and (hex) 007F. That is, “LATIN SMALL LETTER A WITH DIAERESIS” and just “LATIN SMALL LETTER A” like this. Columns
hex_val become populated with the dec (number) and the hex (char) reprasentation of the actual character.
create table pewe_cs ( val varchar2(4000), dec_val number, hex_val varchar2(4000) ); insert into pewe_cs (val, dec_val, hex_val) values ('ä', ascii('ä'), asciistr('ä')); insert into pewe_cs (val, dec_val, hex_val) values ('a', ascii('a'), asciistr('a')); commit;
Upon an initial select, one may already go stuck for astonishment … in fact, one may ask: why is there a different hex notation for the two characters and for what reason does a hex value of 00E4 evaluate to a dec value of 50084??? First answer: well,
asciistr just returns the character in question for any ascii like item of (dec) <= 127 and the hex reprasentation otherwise, puh! Second answer: well(2),
ascii returns a dec value regarding utf8 unicode storage whereas
asciistr delivers the hex regarding the overall utf16-based unicode standard, puh(2)!.
SQL> column val format a10 SQL> column dec_val format 9999999999 SQL> column hex_val format a10 SQL> select val, dec_val, hex_val from pewe_cs order by val; VAL DEC_VAL HEX_VAL ---------- ----------- ---------- a 97 a ä 50084 \00E4
Want more quirks? There is another sql function, namely
unistr that promises a reconversion against
asciistr (at least imho). The next spell is, however, that
unistr returns the character being encoded in the national and no longer in the original database characterset. Sorrounding
dump, see below) is one means to show the difference:
SQL> select val, dec_val, hex_val, unistr(hex_val) as uni, ascii(unistr(hex_val)) as dec_uni from pewe_cs order by val; VAL DEC_VAL HEX_VAL UNI DEC_UNI ---------- ----------- ---------- ---------- ----------- a 97 a a 97 ä 50084 \00E4 ä 228
It has to be noted though that the hex to char and vice versa conversation using
unistr does produce equivalent, i.e. comparable values, despite the different character sets. The same aplies to
SQL> select unistr(asciistr(val)) as uni, decode(unistr(asciistr(val)), val, 'ok', 'fishy') as cmp_uni from pewe_cs order by val; UNI CMP_UNI ---------- ---------- a ok ä ok
So what? There is a very comprehensive introduction to unicode from Joel Spolsky as The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Read it (down to the end)! One technically fool-fledged option is to follow the excellent approach introduced by Joseph Fuda and being published as Unicode Code Points in SQL or PL/SQL. Some more insight to unicode trade has also been presented Sanjay Jha in thread item Look at DOC Id 69518.1 in Metalink for a detailed answer to this query. Another common, google pushed source is Flavio Casetta’s article ASCII, ASCIISTR, CHR, UNISTR: many ways of converting characters, that on the contrary only scapes the surface of the arising questions and does not tackle the problem of the underlying character encoding.
Another (poor man’s) approach is to convert the
unistr result to the database characterset explicitely iff any further opration of
ascii and probably
chr is nessecary. Beware however, that this employs an a priori knowledge of the database characterset in the call to
convert and may not be portable (you may use column names with the second parameter of
SQL> select (case when(ascii(unistr(hex_val)) > 127) then chr(ascii(convert(unistr(hex_val), 'AL32UTF8'))) else chr(ascii(unistr(hex_val))) end) as chr_dec_conv_uni from pewe_cs order by val; CHR_DEC_CO ---------- a ä
Last but not least, there is a sql function called
dump that shows the characterset of some char value along with the hex (1016) or dec (1010) reprasentation (an extension to
utl_i18n.string_to_raw, that leaves the characterset out). This also reveals that applying
ascii to a non-converted
unistr result as in
dump(chr(ascii(unistr(hex_val))), 1016) produces unicode garbage, showing utf16 code points and claiming utf8 encoding in column
SQL> select val, dump(val, 1016) as dump_val, dump(unistr(hex_val), 1016) as dump_uni, dump(chr(ascii(unistr(hex_val))), 1016) as dump_chr_d ec_uni from pewe_cs order by val; VAL DUMP_VAL DUMP_UNI DUMP_CHR_DEC_UNI ---------- ---------------------- ---------------------- ---------------------- a Typ=1 Len=1 CharacterS Typ=1 Len=2 CharacterS Typ=1 Len=1 CharacterS et=AL32UTF8: 61 et=AL16UTF16: 0,61 et=AL32UTF8: 61 ä Typ=1 Len=2 CharacterS Typ=1 Len=2 CharacterS Typ=1 Len=1 CharacterS et=AL32UTF8: c3,a4 et=AL16UTF16: 0,e4 et=AL32UTF8: e4
For the curious, removing the
dump call from column
DUMP_CHR_DEC_UNI dedicatedly produces the famous
SQL> select val, chr(ascii(unistr(hex_val))) as chr_dec_uni from pewe_cs order by val; ERROR: ORA-29275: Unvollständiges Mehr-Byte-Zeichen
Phew, you remember, that clob-columns always use a ucs2 aka utf16 encoding (Storing Varying-Width Character Data in LOBs) and all lineup the above may change, do you, heehh?
have fun p