Doing an export/import or a ctas or an oci/jdbc client action into a database that has smaller sized character set, multi- to single-byte for example, will raise the problem of information loss in terms of character data. The information loss, however, is not limited to loosing one or the other character. Some characters may also become replaced by a default or a best guess replacement character in the target character set automatically.
Profound analysis of the outcome of the character set conversion might also cause difficulties when only the source database is already available and just an estimation of the information loss is what you need. You may come up with good old
convert() to execute an input/output compare but aside from using
convert() is discouraged with the latest releases of Oracle,
convert() will also fracture the strings on input as soon as the first multibyte character appears.
convert() will alert you of some problem but it will not tell you why or even better: what character is (first) the stumbling block.
Borrowing some functionality from the xml/html realm of Oracle may help us out here though. Namely
utl_i18n.escape_reference() serves to generate (xml/html) documents in an encoding for future external use. Just the scenario we have while not generating (xml/html) documents actually but target database character set string column values. The following code and results shows anything to grasp the idea of populating an analysis baseline data set using
utl_i18n.escape_reference(). Not only does
utl_i18n.escape_reference() produce an input/output diff, it will also name the characters in question, up the the end of the string. Note, however, that
utl_i18n.escape_reference(), originating from the xml/html realm, takes another preprocessing step of escaping the xml/html big5 characters. Since the big5 reside below position 127 in charcter sets, like US7ASCII, this should not be much of a problem anyway.
select str, str_t, utl_i18n.escape_reference(str_t, 'AL32UTF8') as str_utf_8 , utl_i18n.escape_reference(str_t, 'WE8MSWIN1252') as str_we8_1252 , utl_i18n.escape_reference(str_t, 'US7ASCII') as str_us7_asc , utl_i18n.escape_reference(str_t, 'WE8ISO8859P1') as str_we8_8859_1 , utl_i18n.escape_reference(str_t, 'WE8ISO8859P15') as str_we8_8859_15 from (select str, translate(str, '<>&"''', '?????') as str_t from (select 'ä W ß € ğ á ā ç ã ø æ ’ <>&"''' as str from dual));
|str||ä W ß € ğ á a ç ã ø æ ’ <>&”‘||just a copy|
|str_t||ä W ß € ğ á a ç ã ø æ ’ ?????||the big5 being translated/invalidated|
|str_utf_8||ä W ß € ğ á a ç ã ø æ ’ ?????||just checking what happens when converting from/to AL32UTF8|
|str_we8_1252||ä W ß € g á a ç ã ø æ ’ ?????||“ğ” as an eastern europe entity has been silently converted by a defined replacement character of the WE8MSWIN1252 character set – not obvious at a first glance|
|str_us7_asc||a W ß € g a a c ã ø æ ’ ?????||uuhmp – well, nothing more or less to expect, 7 bits only …|
|str_we8_8859_1||ä W ß € g á a ç ã ø æ ’ ?????||eastern europe “ğ” again like in the WE8MSWIN1252 case, also “€” comes in for it was invented after the introduction of WE8ISO8859P1, finally “’” as the windows single quote sign falls short|
|str_we8_8859_15||ä W ß € g á a ç ã ø æ ’ ?????||with WE8ISO8859P15, “€” is known now|
have fun, peter