The Solr documentation for XPathEntityProcessor introduces a spezialization subtype of
EntityProcessor that is primarily depicted to process data (to be) imported from xml/http-datasources (for example, Usage with XML/HTTP Datasource). However, using
XPathEntityProcessor on a FieldReaderDataSource instead on the original
!HttpDataSource (search for
FieldReaderDataSource in Uploading Structured Data Store Data with the Data Import Handler) enables reading xml instances contained in columns delivered from database requests through
Bewildered out of words and meanings…? Don’t worry, the following will give you a living example of how to craft the xml from an Oracle database easily and what to do on the Solr side to map the information datums into indexing fields. To me, this is really a nice example of how to employ xml in a true sense of a defined (well-forming, encoding) data exchange layer, hiding most if not all of the implementation details of xml processing on the database and on the search-engine. Note however, that this great time-to-market, through xml processing technically, always comes at a certain extra cost such that the xml-instances shall not become to large for this solution pattern. I will also use xml attributes for small size values instead of tags in the xml generation as one step of optimization.
The statement below provides a general pattern concerning lightweight and easy xml generation for our purposes. A number of
rows (5) will be populated withing a simple (root)
xmlserialize (2) then dumps the memory structure into a
clob. As just one optimization, mentioned above, small size values go as
row attributes, which saves two dom node reads per value on (re-)parsing, and any larger size values will be flattened as regular tags. Great to have
xmlforest accept one to multiple values (columns) such that the statements looks very sql-like and easy to read. Of course, you may go into detail further, specifying non-generated attribute names and the like, but that’s not part of the discussion here.
with tmp as (select * from user_objects where rownum < 4) select xmlserialize( content xmlelement("rowset", xmlagg(xmlelement("row", xmlattributes(object_id, object_name, 'abc&"' as my_long_text), xmlforest(subobject_name, created, 'abc&"' as my_long_text)))) as clob) from tmp t1;
The dump of the xml instance just for a review.
<rowset> <row OBJECT_ID="101699" OBJECT_NAME="QUERY" MY_LONG_TEXT="abc&""> <CREATED>2016-07-21</CREATED> <MY_LONG_TEXT>abc&"</MY_LONG_TEXT> </row> <row OBJECT_ID="101701" OBJECT_NAME="SYS_IL0000101699C00005$$" MY_LONG_TEXT="abc&""> <CREATED>2016-07-21</CREATED> <MY_LONG_TEXT>abc&"</MY_LONG_TEXT> </row> <row OBJECT_ID="101700" OBJECT_NAME="SYS_LOB0000101699C00005$$" MY_LONG_TEXT="abc&""> <CREATED>2016-07-21</CREATED> <MY_LONG_TEXT>abc&"</MY_LONG_TEXT> </row> </rowset>
On the Solr side then, in
dataimport-config.xml, we need another datasource of type
FieldReaderDataSource (7) along with the original jdbc one (3). In (11), see
xml_col, the xml will be fetched from the database along with other root entity columns. The entity starting on (14) actually does the xml processing as well as mapping the information datums into indexing fields, placing a couple of xpath expressions for the node loop, see
forEach = (16), and the distinct node access, see
xpath = (17).
<?xml version = "1.0" encoding = "UTF-8" ?> <dataConfig> <dataSource name = "rootDS" type = "JdbcDataSource" driver = "oracle.jdbc.OracleDriver" url = "***" user = "***" password = "***" readonly = "true" batchSize = "100" defaultLobPrefetchSize = "8192" /> <dataSource name = "depsXML" type = "FieldReaderDataSource" /> <document name = "DOC"> <entity name = "ROOT" pk = "DOCUMENT_ID" transformer = "ClobTransformer" dataSource = "rootDS" query = "select document_id, xml_col from somewhere" > <field name = "DOCUMENT_ID" column = "DOCUMENT_ID" /> ... <entity name = "DEPS" dataSource = "depsXML" processor = "XPathEntityProcessor" dataField="ROOT.XML_COL" forEach = "/rowset/row" > <field column = "OBJECT_ID" xpath = "/rowset/row/@OBJECT_ID" /> ... </entity> </entity> </document> </dataConfig>
That’s it. I hope the pattern may help you out some day.