Reading an xml file as an oracle external table


There is a couple of posts around that imply or claim an oracle external table to be able to read xml files by design. This is not true, really. It may work for you or actually may fit your current xml file structure but you should generally not approach to read xml by external tables. Here is why.

What external tables or sql loader for better can do actually is reading line oriented, file based data in bulk and quite fast. It offers a lot of settings to parametrize the loading process, the reader, according to your input format. You may for example have a look at the excellent series of posts discussing this options by Jiri starting here Oracle External Tables by Examples part 1 – TAB delimited fields.

Well, this functionality, may also be (mis)used to read xml instances from files. Given a file as the following one may be interested in extracting the contents of the token tags.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<FILE_INFO>
	<PERSON_BILD>
		<token_row><token>Fritz &amp; Fischer</token></token_row>
		<token_row><token>Boris Borsberg</token></token_row>
	</PERSON_BILD>
</FILE_INFO>

Some external table definition may then look like this:

create table test_ext (
 token char(256))
organization external (
  type oracle_loader
  default directory c_temp
  access parameters (
    records delimited by "</token_row>"
    badfile c_temp:'test.bad'
    logfile c_temp:'test.log'
    fields (
      delim char(2000) terminated by "<token_row>",
      token char(256) enclosed by "<token>" and "</token>" )
  )
  location ('test.xml')
)
reject limit unlimited
nomonitoring
/

This table or loader definition reads two fields from the file, delim and token. delim actually implements a /dev/null logic and will not be mentioned as a column of the table. That is, one is willing to discard any content the fits the definition of the delim field. The token field, on the other hand, explicitely cuts the eventual content from the token tag.

Iff you now go and execute a select on the external table you may realize the first pitfall of that approach. There is no xml decoding being done for you automatically, as any xml parser does.

select token from test_ext;
TOKEN
-------------------
Fritz &amp; Fischer
Boris Borsberg

You eventually have to enclose your column with some xml content decoding logic to achive your objective, probably like this or that, integrating xmltype or xmlparse respectively:

select extractvalue(xmltype('<tag>' || token || '</tag>'), '/tag/text()') as token from test_ext;
select extractvalue(xmlparse(CONTENT '<tag>' || token || '</tag>'), '/tag/text()') as token from test_ext;
TOKEN
---------------
Fritz & Fischer
Boris Borsberg

So far, anythings fine? Maybe. Iff you, however, really want to know what’s going on with the sql loader, just do name the delim field as a column of the external table and take a look what has been read in there.

drop table test_ext
/
create table test_ext (
 delim char(2000),
 token char(256))
organization external (
  type oracle_loader
  default directory c_temp
  access parameters (
    records delimited by "</token_row>"
    badfile c_temp:'test.bad'
    logfile c_temp:'test.log'
    fields (
      delim char(2000) terminated by "<token_row>",
      token char(256) enclosed by "<token>" and "</token>" )
  )
  location ('test.xml')
)
reject limit unlimited
nomonitoring
/

Now, on a select on both delim and token we get (control chars being made visible):

select trim(replace(replace(delim, chr(10), '\n'), chr(9), '\t')), token from test_ext;
DELIM                                                                                        TOKEN
-------------------------------------------------------------------------------------------  -------------------
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\n<FILE_INFO>\n\t<PERSON_BILD>\n\t\t  Fritz &amp; Fischer
\n\t\t                                                                                       Boris Borsberg

You see, delim eats up any uninteresting chars, including line breaks and tabs, out of the interesting section of the token tags. Inspecting the file test.bad on the server also shows that the tail of the xml file neither does match any definition of the external table statement and will be discarded as well.

	</PERSON_BILD>
</FILE_INFO>
</token_row>

So, what’s up now? A simple conclusion from the lessons learned is for example that you always need a row tag to read xml by an ext table. Something like the following will not serve a appropriate base format:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<FILE_INFO>
	<PERSON_BILD>
		<token>Fritz &amp; Fischer</token>
		<token>Boris Borsberg</token>
	</PERSON_BILD>
</FILE_INFO>

Also, iff the is a lot of content to be eaten up into the delim field you may get an overflow exception since the space of the external table fields / column is limited by sql.

Have fun.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s