Oracle character set conversion downgrade checkup with utl_i18n


Doing an export/import or a ctas or an oci/jdbc client action into a database that has smaller sized character set, multi- to single-byte for example, will raise the problem of information loss in terms of character data. The information loss, however, is not limited to loosing one or the other character. Some characters may also become replaced by a default or a best guess replacement character in the target character set automatically.
Profound analysis of the outcome of the character set conversion might also cause difficulties when only the source database is already available and just an estimation of the information loss is what you need. You may come up with good old convert() to execute an input/output compare but aside from using convert() is discouraged with the latest releases of Oracle, convert() will also fracture the strings on input as soon as the first multibyte character appears. convert() will alert you of some problem but it will not tell you why or even better: what character is (first) the stumbling block.

Borrowing some functionality from the xml/html realm of Oracle may help us out here though. Namely utl_i18n.escape_reference() serves to generate (xml/html) documents in an encoding for future external use. Just the scenario we have while not generating (xml/html) documents actually but target database character set string column values. The following code and results shows anything to grasp the idea of populating an analysis baseline data set using utl_i18n.escape_reference(). Not only does utl_i18n.escape_reference() produce an input/output diff, it will also name the characters in question, up the the end of the string. Note, however, that utl_i18n.escape_reference(), originating from the xml/html realm, takes another preprocessing step of escaping the xml/html big5 characters. Since the big5 reside below position 127 in charcter sets, like US7ASCII, this should not be much of a problem anyway.

select str, str_t, utl_i18n.escape_reference(str_t, 'AL32UTF8') as str_utf_8
  , utl_i18n.escape_reference(str_t, 'WE8MSWIN1252') as str_we8_1252
  , utl_i18n.escape_reference(str_t, 'US7ASCII') as str_us7_asc
  , utl_i18n.escape_reference(str_t, 'WE8ISO8859P1') as str_we8_8859_1
  , utl_i18n.escape_reference(str_t, 'WE8ISO8859P15') as str_we8_8859_15
from (select str, translate(str, '<>&"''', '?????') as str_t
  from (select 'ä W ß € ğ á ā ç ã ø æ ’ <>&"''' as str from dual));

.

Col Value Comment
str ä W ß € ğ á a ç ã ø æ ’ <>&”‘ just a copy
str_t ä W ß € ğ á a ç ã ø æ ’ ????? the big5 being translated/invalidated
str_utf_8 ä W ß € ğ á a ç ã ø æ ’ ????? just checking what happens when converting from/to AL32UTF8
str_we8_1252 ä W ß € g á a ç ã ø æ ’ ????? “ğ” as an eastern europe entity has been silently converted by a defined replacement character of the WE8MSWIN1252 character set – not obvious at a first glance
str_us7_asc a W &#xdf; &#x20ac; g a a c &#xe3; &#xf8; &#xe6; &#x2019; ????? uuhmp – well, nothing more or less to expect, 7 bits only …
str_we8_8859_1 ä W ß &#x20ac; g á a ç ã ø æ &#x2019; ????? eastern europe “ğ” again like in the WE8MSWIN1252 case, also “€” comes in for it was invented after the introduction of WE8ISO8859P1, finally “’” as the windows single quote sign falls short
str_we8_8859_15 ä W ß € g á a ç ã ø æ &#x2019; ????? with WE8ISO8859P15, “€” is known now

have fun, peter

Advertisements

One comment

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s