Using the DataImportHandler XPathEntityProcessor on a Database Resultset Column


The Solr documentation for XPathEntityProcessor introduces a spezialization subtype of EntityProcessor that is primarily depicted to process data (to be) imported from xml/http-datasources (for example, Usage with XML/HTTP Datasource). However, using XPathEntityProcessor on a FieldReaderDataSource instead on the original URLDataSource or !HttpDataSource (search for FieldReaderDataSource in Uploading Structured Data Store Data with the Data Import Handler) enables reading xml instances contained in columns delivered from database requests through SqlEntityProcessor.
Bewildered out of words and meanings…? Don’t worry, the following will give you a living example of how to craft the xml from an Oracle database easily and what to do on the Solr side to map the information datums into indexing fields. To me, this is really a nice example of how to employ xml in a true sense of a defined (well-forming, encoding) data exchange layer, hiding most if not all of the implementation details of xml processing on the database and on the search-engine. Note however, that this great time-to-market, through xml processing technically, always comes at a certain extra cost such that the xml-instances shall not become to large for this solution pattern. I will also use xml attributes for small size values instead of tags in the xml generation as one step of optimization.

The statement below provides a general pattern concerning lightweight and easy xml generation for our purposes. A number of rows (5) will be populated withing a simple (root) rowset (4), xmlserialize (2) then dumps the memory structure into a clob. As just one optimization, mentioned above, small size values go as row attributes, which saves two dom node reads per value on (re-)parsing, and any larger size values will be flattened as regular tags. Great to have xmlattributes and xmlforest accept one to multiple values (columns) such that the statements looks very sql-like and easy to read. Of course, you may go into detail further, specifying non-generated attribute names and the like, but that’s not part of the discussion here.

with tmp as (select * from user_objects where rownum < 4)
select xmlserialize(
  content
    xmlelement("rowset",
      xmlagg(xmlelement("row",
        xmlattributes(object_id, object_name, 'abc&"' as my_long_text),
        xmlforest(subobject_name, created, 'abc&"' as my_long_text))))
  as clob) from tmp t1;

The dump of the xml instance just for a review.

<rowset>
  <row OBJECT_ID="101699" OBJECT_NAME="QUERY" MY_LONG_TEXT="abc&amp;&quot;">
    <CREATED>2016-07-21</CREATED>
    <MY_LONG_TEXT>abc&amp;&quot;</MY_LONG_TEXT>
  </row>
  <row OBJECT_ID="101701" OBJECT_NAME="SYS_IL0000101699C00005$$"
      MY_LONG_TEXT="abc&amp;&quot;">
    <CREATED>2016-07-21</CREATED>
    <MY_LONG_TEXT>abc&amp;&quot;</MY_LONG_TEXT>
  </row>
  <row OBJECT_ID="101700" OBJECT_NAME="SYS_LOB0000101699C00005$$"
      MY_LONG_TEXT="abc&amp;&quot;">
    <CREATED>2016-07-21</CREATED>
    <MY_LONG_TEXT>abc&amp;&quot;</MY_LONG_TEXT>
  </row>
</rowset>

On the Solr side then, in dataimport-config.xml, we need another datasource of type FieldReaderDataSource (7) along with the original jdbc one (3). In (11), see xml_col, the xml will be fetched from the database along with other root entity columns. The entity starting on (14) actually does the xml processing as well as mapping the information datums into indexing fields, placing a couple of xpath expressions for the node loop, see forEach = (16), and the distinct node access, see xpath = (17).

<?xml version = "1.0" encoding = "UTF-8" ?>
<dataConfig>
  <dataSource name = "rootDS" type = "JdbcDataSource"
    driver = "oracle.jdbc.OracleDriver" url = "***" user = "***"
    password = "***" readonly = "true" batchSize = "100"
    defaultLobPrefetchSize = "8192" />
  <dataSource name = "depsXML" type = "FieldReaderDataSource" />
  <document name = "DOC">
    <entity name = "ROOT" pk = "DOCUMENT_ID"
        transformer = "ClobTransformer" dataSource = "rootDS"
        query = "select document_id, xml_col from somewhere" >
      <field name = "DOCUMENT_ID" column = "DOCUMENT_ID" />
      ...
      <entity name = "DEPS" dataSource = "depsXML"
          processor = "XPathEntityProcessor"
          dataField="ROOT.XML_COL" forEach = "/rowset/row" >
        <field column = "OBJECT_ID" xpath = "/rowset/row/@OBJECT_ID" />
        ...
      </entity>
    </entity>
  </document>
</dataConfig>

That’s it. I hope the pattern may help you out some day.

Enjoy, Peter.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s