Just another discussion of unicode character conversion for oracle


Running an oracle in multibyte unicode storage like AL32UTF8, disregarding the char and byte column length topic, is actually no different from the old days single byte storage, e.g. in WE8MSWIN1252. However, any job that includes sort of character conversion in terms of character, decimal and hex reprasentations, does require at least a basic understanding of available unicode storage options and sql functions with oracle. To me, the main reason of common problems is the mismatch being imposed by oracle’s impure layout of the sql functions ascii, asciistr, chr, nchr and unistr concerning the database and the national characterset.

The following has been executed on a 11gR2 on win64 using these database and the national characterset storage options.

SQL> select * from NLS_database_PARAMETERS where parameter like '%CHARACTERSET%';
PARAMETER                VALUE
---------                -----
NLS_CHARACTERSET         AL32UTF8
NLS_NCHAR_CHARACTERSET   AL16UTF16

Necessary test data only needs to comprise two rows, featuring two characters, one below and one above the 7 bit ascii code point border of (dec) 127 and (hex) 007F. That is, “LATIN SMALL LETTER A WITH DIAERESIS” and just “LATIN SMALL LETTER A” like this. Columns dec_val and hex_val become populated with the dec (number) and the hex (char) reprasentation of the actual character.

create table pewe_cs (
  val varchar2(4000),
  dec_val number,
  hex_val varchar2(4000)
);

insert into pewe_cs (val, dec_val, hex_val) values ('ä', ascii('ä'), asciistr('ä'));
insert into pewe_cs (val, dec_val, hex_val) values ('a', ascii('a'), asciistr('a'));
commit;

Upon an initial select, one may already go stuck for astonishment … in fact, one may ask: why is there a different hex notation for the two characters and for what reason does a hex value of 00E4 evaluate to a dec value of 50084??? First answer: well, asciistr just returns the character in question for any ascii like item of (dec) <= 127 and the hex reprasentation otherwise, puh! Second answer: well(2), ascii returns a dec value regarding utf8 unicode storage whereas asciistr delivers the hex regarding the overall utf16-based unicode standard, puh(2)!.

SQL> column val format a10
SQL> column dec_val format 9999999999
SQL> column hex_val format a10
SQL> select val, dec_val, hex_val from pewe_cs order by val;

VAL            DEC_VAL HEX_VAL
---------- ----------- ----------
a                   97 a
ä                50084 \00E4

Want more quirks? There is another sql function, namely unistr that promises a reconversion against asciistr (at least imho). The next spell is, however, that unistr returns the character being encoded in the national and no longer in the original database characterset. Sorrounding unistr with ascii (or dump, see below) is one means to show the difference:

SQL> select val, dec_val, hex_val, unistr(hex_val) as uni,
  ascii(unistr(hex_val)) as dec_uni
from pewe_cs order by val;

VAL            DEC_VAL HEX_VAL    UNI            DEC_UNI
---------- ----------- ---------- ---------- -----------
a                   97 a          a                   97
ä                50084 \00E4      ä                  228

It has to be noted though that the hex to char and vice versa conversation using asciistr and unistr does produce equivalent, i.e. comparable values, despite the different character sets. The same aplies to ascii, chr and nchr, respectively.

SQL> select unistr(asciistr(val)) as uni,
  decode(unistr(asciistr(val)), val, 'ok', 'fishy') as cmp_uni
from pewe_cs order by val;

UNI        CMP_UNI
---------- ----------
a          ok
ä          ok

So what? There is a very comprehensive introduction to unicode from Joel Spolsky as The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Read it (down to the end)! One technically fool-fledged option is to follow the excellent approach introduced by Joseph Fuda and being published as Unicode Code Points in SQL or PL/SQL. Some more insight to unicode trade has also been presented Sanjay Jha in thread item Look at DOC Id 69518.1 in Metalink for a detailed answer to this query. Another common, google pushed source is Flavio Casetta’s article ASCII, ASCIISTR, CHR, UNISTR: many ways of converting characters, that on the contrary only scapes the surface of the arising questions and does not tackle the problem of the underlying character encoding.

Another (poor man’s) approach is to convert the unistr result to the database characterset explicitely iff any further opration of ascii and probably chr is nessecary. Beware however, that this employs an a priori knowledge of the database characterset in the call to convert and may not be portable (you may use column names with the second parameter of convert nevertheless).

SQL> select (case when(ascii(unistr(hex_val)) > 127)
  then chr(ascii(convert(unistr(hex_val), 'AL32UTF8')))
  else chr(ascii(unistr(hex_val))) end) as chr_dec_conv_uni
from pewe_cs order by val;

CHR_DEC_CO
----------
a
ä

Last but not least, there is a sql function called dump that shows the characterset of some char value along with the hex (1016) or dec (1010) reprasentation (an extension to utl_i18n.string_to_raw, that leaves the characterset out). This also reveals that applying chr and ascii to a non-converted unistr result as in dump(chr(ascii(unistr(hex_val))), 1016) produces unicode garbage, showing utf16 code points and claiming utf8 encoding in column DUMP_CHR_DEC_UNI.

SQL> select val, dump(val, 1016) as dump_val, dump(unistr(hex_val), 1016) as dump_uni, dump(chr(ascii(unistr(hex_val))), 1016) as dump_chr_d
ec_uni from pewe_cs order by val;

VAL        DUMP_VAL               DUMP_UNI               DUMP_CHR_DEC_UNI
---------- ---------------------- ---------------------- ----------------------
a          Typ=1 Len=1 CharacterS Typ=1 Len=2 CharacterS Typ=1 Len=1 CharacterS
           et=AL32UTF8: 61        et=AL16UTF16: 0,61     et=AL32UTF8: 61

ä          Typ=1 Len=2 CharacterS Typ=1 Len=2 CharacterS Typ=1 Len=1 CharacterS
           et=AL32UTF8: c3,a4     et=AL16UTF16: 0,e4     et=AL32UTF8: e4

For the curious, removing the dump call from column DUMP_CHR_DEC_UNI dedicatedly produces the famous ORA-29275 exception.

SQL> select val, chr(ascii(unistr(hex_val))) as chr_dec_uni from pewe_cs order by val;
ERROR: ORA-29275: Unvollständiges Mehr-Byte-Zeichen

Phew, you remember, that clob-columns always use a ucs2 aka utf16 encoding (Storing Varying-Width Character Data in LOBs) and all lineup the above may change, do you, heehh?

have fun p

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s