Do not filter for dedicated access in (oracle) sparql


It took me a resort to good old autotrace lately when attempting to optimize a sparql query in oracle’s semantic technology stack. The point was actually, that the still growing graph of data rendered some queries, that were fast once, into lame old rusty cars (another good old story as well).

So, what sort of query do I talk about? Not much of a monster, being stripped down for simplicity just imagine a graph of newspaper articles that relate to some named entities by nodes representing a named entity recognition match and offering a hook to hang up additional match information, say the weight of the match and so on. The query features sort of an entry point (that it was’nt actually!), the uri of the article here. That entry point is just applied as a filter expression to this pattern nodes (?art_1 = article, ?sim_1 = sem item match, ?sitm = sem item). Executing the following simplified stuff originally took around 8 secs run time, cough, cough.

select art_1, sim_1, sitm
from table(
  sem_match('{
    ?art_1 tpcs:hasSemItemMatch ?sim_1 . ?sim_1 tpcs:hasSemItem ?sitm .
    filter ( (?art_1 = <http://www.topic-s.de/topics-facts/id/article/926791705>) ) }',
  sem_models('topics'), null,
  sem_aliases(sem_alias('tpcs','http://www.topic-s.de/topics-schema#')), null));

Taking a look at the execution plan produced by set autotrace traceonly revealed where the performance gets lost.

------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name                     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                          | 43212 |    11M|       | 26645   (2)| 00:05:20 |       |       |
|*  1 |  HASH JOIN                |                          | 43212 |    11M|     9M| 26645   (2)| 00:05:20 |       |       |
|*  2 |   HASH JOIN               |                          | 43212 |  9579K|  2472K| 20519   (2)| 00:04:07 |       |       |
|*  3 |    HASH JOIN              |                          | 12431 |  2318K|  1888K| 10896   (3)| 00:02:11 |       |       |
|*  4 |     HASH JOIN             |                          | 12431 |  1735K|  1176K|  5168   (6)| 00:01:03 |       |       |
|*  5 |      TABLE ACCESS FULL    | RDF_VALUE$               | 10097 |  1055K|       |  2990   (9)| 00:00:36 |       |       |
|   6 |      PARTITION LIST SINGLE|                          |   265K|  9339K|       |  1513   (1)| 00:00:19 |     2 |     2 |
|*  7 |       INDEX RANGE SCAN    | RDF_LNK_PVIDCENSNMID_IDX |   265K|  9339K|       |  1513   (1)| 00:00:19 |     2 |     2 |
|   8 |     TABLE ACCESS FULL     | RDF_VALUE$               |  1009K|    46M|       |  2756   (1)| 00:00:34 |       |       |
|   9 |    PARTITION LIST SINGLE  |                          |  1840K|    63M|       |  5300   (2)| 00:01:04 |     2 |     2 |
|* 10 |     INDEX FAST FULL SCAN  | RDF_LNK_PVIDCENSNMID_IDX |  1840K|    63M|       |  5300   (2)| 00:01:04 |     2 |     2 |
|  11 |   TABLE ACCESS FULL       | RDF_VALUE$               |  1009K|    46M|       |  2756   (1)| 00:00:34 |       |       |
------------------------------------------------------------------------------------------------

5 - filter("SEM_APIS"."GETV$RDFTVAL"("V5"."VALUE_TYPE","V5"."LITERAL_TYPE","V5"."LANGUAGE_TYPE","V5"."VNAME_PREFIX"
           ,"V5"."VNAME_SUFFIX")='<'||'http://www.topic-s.de/topics-facts/id/article/926791705'||'>')

We namely have three full table scans against rdf_value$ one fat fast full index scan against RDF_LNK_PVIDCENSNMID_IDXand the resulting, either fat hash joins. Look at the number of rows being touched! Step 5 incorporates a filter on a full table scan.

Transfering the filter expression into the dedicated slot of the sem_match table function does not change the behaviour, really.

select art_1, sim_1, sitm
from table(
  sem_match('{
    ?art_1 tpcs:hasSemItemMatch ?sim_1 . ?sim_1 tpcs:hasSemItem ?sitm }',
  sem_models('topics'), null,
  sem_aliases(sem_alias('tpcs','http://www.topic-s.de/topics-schema#')),
  'art_1 = ''http://www.topic-s.de/topics-facts/id/article/926791705'''));

The execution plan changes literally but even states a higher cost.

------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name                     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                          |   907K|   187M|       | 37963   (1)| 00:07:36 |       |       |
|*  1 |  HASH JOIN                |                          |   907K|   187M|    58M| 37963   (1)| 00:07:36 |       |       |
|   2 |   TABLE ACCESS FULL       | RDF_VALUE$               |  1020K|    46M|       |  2756   (1)| 00:00:34 |       |       |
|*  3 |   HASH JOIN               |                          |   907K|   145M|    36M| 24558   (1)| 00:04:55 |       |       |
|*  4 |    HASH JOIN              |                          |   265K|    33M|    24M| 13283   (1)| 00:02:40 |       |       |
|*  5 |     HASH JOIN             |                          |   265K|    21M|    12M|  7759   (1)| 00:01:34 |       |       |
|   6 |      PARTITION LIST SINGLE|                          |   265K|  9338K|       |  1485   (1)| 00:00:18 |     2 |     2 |
|*  7 |       INDEX RANGE SCAN    | RDF_LNK_PVIDCENSNMID_IDX |   265K|  9338K|       |  1485   (1)| 00:00:18 |     2 |     2 |
|   8 |      TABLE ACCESS FULL    | RDF_VALUE$               |  1020K|    46M|       |  2756   (1)| 00:00:34 |       |       |
|*  9 |     TABLE ACCESS FULL     | RDF_VALUE$               |   542K|    24M|       |  2767   (2)| 00:00:34 |       |       |
|  10 |    PARTITION LIST SINGLE  |                          |  1835K|    63M|       |  5270   (2)| 00:01:04 |     2 |     2 |
|* 11 |     INDEX FAST FULL SCAN  | RDF_LNK_PVIDCENSNMID_IDX |  1835K|    63M|       |  5270   (2)| 00:01:04 |     2 |     2 |
------------------------------------------------------------------------------------------------

9 - filter("V0"."VALUE_NAME"='http://www.topic-s.de/topics-facts/id/article/926791705')

Again, there is this filter on a full table scan in step 9.

So, what’s the lesson so far? To me, do not expect oracle to push filter expressions into the graph-based sparql execution plan by selectivity as we know it from plain sql (or even query hints in oracle text). It seems that selectivity, or a query entry point as being called above, needs to be introduced as an integral part of the graph pattern expression to show up the expected effect. Let’s go and check out for that.

select 'tpcs_f:article/926791705' as art_1, sim_1, sitm
from table(
  sem_match('{
    tpcs_f:article/926791705 tpcs:hasSemItemMatch ?sim_1 . ?sim_1 tpcs:hasSemItem ?sitm . }',
  sem_models('topics'), null,
  sem_aliases(sem_alias('tpcs','http://www.topic-s.de/topics-schema#')
    sem_alias('tpcs_f','http://www.topic-s.de/topics-facts/id/')), null));

Executes in a flash, what about the execution plan?

------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name                    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                         |     2 |   336 |    11   (0)| 00:00:01 |       |       |
|   1 |  NESTED LOOPS                   |                         |       |       |            |       |          |       |
|   2 |   NESTED LOOPS                  |                         |     2 |   336 |    11   (0)| 00:00:01 |       |       |
|   3 |    NESTED LOOPS                 |                         |     2 |   240 |     7   (0)| 00:00:01 |       |       |
|   4 |     NESTED LOOPS                |                         |     1 |    84 |     5   (0)| 00:00:01 |       |       |
|   5 |      PARTITION LIST SINGLE      |                         |     1 |    36 |     3   (0)| 00:00:01 |     2 |     2 |
|*  6 |       INDEX RANGE SCAN          | RDF_LNK_PVIDSNCENEN_IDX |     1 |    36 |     3   (0)| 00:00:01 |     2 |     2 |
|   7 |      TABLE ACCESS BY INDEX ROWID| RDF_VALUE$              |     1 |    48 |     2   (0)| 00:00:01 |       |       |
|*  8 |       INDEX UNIQUE SCAN         | C_PK_VID                |     1 |       |     1   (0)| 00:00:01 |       |       |
|   9 |     PARTITION LIST SINGLE       |                         |     3 |   108 |     2   (0)| 00:00:01 |     2 |     2 |
|* 10 |      INDEX RANGE SCAN           | RDF_LNK_PVIDSNCENEN_IDX |     3 |   108 |     2   (0)| 00:00:01 |     2 |     2 |
|* 11 |    INDEX UNIQUE SCAN            | C_PK_VID                |     1 |       |     1   (0)| 00:00:01 |       |       |
|  12 |   TABLE ACCESS BY INDEX ROWID   | RDF_VALUE$              |     1 |    48 |     2   (0)| 00:00:01 |       |       |
------------------------------------------------------------------------------------------------

   6 - access("P_VALUE_ID"=3450613457090373392 AND "START_NODE_ID"=6841705143039496909 AND "CANON_END_NODE_ID">0 AND "CANON_END_NODE_ID" IS NOT NULL)
   8 - access("CANON_END_NODE_ID"="V0"."VALUE_ID")
  10 - access("P_VALUE_ID"=3042218884911711908 AND "CANON_END_NODE_ID"="START_NODE_ID" AND "CANON_END_NODE_ID">0 AND "CANON_END_NODE_ID" IS NOT NULL) 
       filter("START_NODE_ID">0)
  11 - access("CANON_END_NODE_ID"="V1"."VALUE_ID")

No more full table scans or other full access stuff, no longer any post filter steps, not even a handfull of predicates, a %cpu cost of 11, 374 consistent gets, 0 physical reads! This is what I want to see providing a dedicated node of the graph to a sparql query!

However, I’m still curious about the detailled access path being printed. The 11gR2 oracle documentation at 1.8 Using Semantic Network Indexes states that there are two pre-installed indexes on the semantic net, that is on the table rdf_link$, namely rdf_lnk_pvidcensnmid_idx and rdf_lnk_pvidsncenen_idx, which implement a PCSM (property, object, canonical subject, model) and a PSCF (property, subject, canonical object, canonical object / object) access pattern, respectively. Both indexes feature a leading property column followed by a canonical object or subject.

So what about providing the entry point of the query with the (canonical) object instead of the subject slot? This can be checked out easily, since any newspaper article holds its unique identifier as a purl.org datum as follows.

select art_1, sim_1, sitm
from table(
  sem_match('{
    ?art_1 tpcs:hasSemItemMatch ?sim_1 . ?sim_1 tpcs:hasSemItem ?sitm .
    ?art_1 purl:identifier "926791705"^^xsd:integer }',
  sem_models('topics'), null,
  sem_aliases(sem_alias('tpcs','http://www.topic-s.de/topics-schema#'),
    sem_alias('purl','http://purl.org/dc/elements/1.1/')), null));

The plan does not look bad either, has some more overhead due to the additional node with the graph pattern but most importantly, now names the rdf_lnk_pvidcensnmid_idx index (PCSM) as the one of the deepest index access steps.

------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                     | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                          |     1 |   252 |    13   (0)| 00:00:01 |       |       |
|   1 |  NESTED LOOPS                  |                          |     1 |   252 |    13   (0)| 00:00:01 |       |       |
|   2 |   NESTED LOOPS                 |                          |     1 |   204 |    11   (0)| 00:00:01 |       |       |
|   3 |    NESTED LOOPS                |                          |     1 |   156 |     9   (0)| 00:00:01 |       |       |
|   4 |     NESTED LOOPS               |                          |     1 |   108 |     7   (0)| 00:00:01 |       |       |
|   5 |      NESTED LOOPS              |                          |     1 |    72 |     5   (0)| 00:00:01 |       |       |
|   6 |       PARTITION LIST SINGLE    |                          |     1 |    36 |     3   (0)| 00:00:01 |     2 |     2 |
|*  7 |        INDEX RANGE SCAN        | RDF_LNK_PVIDCENSNMID_IDX |     1 |    36 |     3   (0)| 00:00:01 |     2 |     2 |
|   8 |       PARTITION LIST SINGLE    |                          |     1 |    36 |     2   (0)| 00:00:01 |     2 |     2 |
|*  9 |        INDEX RANGE SCAN        | RDF_LNK_PVIDSNCENEN_IDX  |     1 |    36 |     2   (0)| 00:00:01 |     2 |     2 |
|  10 |      PARTITION LIST SINGLE     |                          |     3 |   108 |     2   (0)| 00:00:01 |     2 |     2 |
|* 11 |       INDEX RANGE SCAN         | RDF_LNK_PVIDSNCENEN_IDX  |     3 |   108 |     2   (0)| 00:00:01 |     2 |     2 |
|  12 |     TABLE ACCESS BY INDEX ROWID| RDF_VALUE$               |     1 |    48 |     2   (0)| 00:00:01 |       |       |
|* 13 |      INDEX UNIQUE SCAN         | C_PK_VID                 |     1 |       |     1   (0)| 00:00:01 |       |       |
|  14 |    TABLE ACCESS BY INDEX ROWID | RDF_VALUE$               |     1 |    48 |     2   (0)| 00:00:01 |       |       |
|* 15 |     INDEX UNIQUE SCAN          | C_PK_VID                 |     1 |       |     1   (0)| 00:00:01 |       |       |
|  16 |   TABLE ACCESS BY INDEX ROWID  | RDF_VALUE$               |     1 |    48 |     2   (0)| 00:00:01 |       |       |
|* 17 |    INDEX UNIQUE SCAN           | C_PK_VID                 |     1 |       |     1   (0)| 00:00:01 |       |       |
------------------------------------------------------------------------------------------------

   7 - access("P_VALUE_ID"=8077892321846961422 AND "CANON_END_NODE_ID"=1170039012141254626 AND "START_NODE_ID">0 AND "START_NODE_ID" IS NOT NULL)
   9 - access("P_VALUE_ID"=3450613457090373392 AND "START_NODE_ID"="START_NODE_ID" AND "CANON_END_NODE_ID">0 AND "CANON_END_NODE_ID" IS NOT NULL)
       filter("START_NODE_ID">0)
  11 - access("P_VALUE_ID"=3042218884911711908 AND "CANON_END_NODE_ID"="START_NODE_ID" AND "CANON_END_NODE_ID">0 AND "CANON_END_NODE_ID" IS NOT NULL)
       filter("START_NODE_ID">0)
  13 - access("CANON_END_NODE_ID"="V1"."VALUE_ID")
  15 - access("START_NODE_ID"="V0"."VALUE_ID")
  17 - access("CANON_END_NODE_ID"="V2"."VALUE_ID")

One should keep in mind that providing a correct datatype for the selective object as above (?art_1 purl:identifier "926791705"^^xsd:integer) avoids any datatype conversion on the execution stack. For example, only writing a string ?art_1 purl:identifier "926791705" incurs an extra hashed index full scan with rownum at the deepest level of the execution plan.

|   7 |        VIEW                     | VW_NSO_1                 |     1 |     2 |     1 (100)| 00:00:01 |       |       |
|   8 |         HASH UNIQUE             |                          |     1 |       |     1 (100)| 00:00:01 |       |       |
|*  9 |          COUNT STOPKEY          |                          |       |       |            |       |          |       |
|  10 |           INDEX FULL SCAN       | SYS_C0042382             |     1 |       |     0   (0)| 00:00:01 |       |       |

   9 - filter(ROWNUM<1)

This is, by the way, also true for filters, that is, handing in filter (?artid_1 = "926791705"^^xsd:integer) or filter (?artid_1 = "926791705") calls different post filter methods of the sem_apis package.

   7 - filter("SEM_APIS"."GETV$NUMERICVAL"("V0"."VALUE_TYPE","V0"."LITERAL_TYPE","V0"."VNAME_PREFIX")=926791705)
--
   10 - filter("SEM_APIS"."GETV$STRINGVAL"("V0"."VALUE_TYPE","V0"."LITERAL_TYPE","V0"."LANGUAGE_TYPE","V0"."VNAME_PREFIX"          ,"V0"."VNAME_SUFFIX")='"'||'926791705'||'"')

Have fun!

Advertisements

One comment

  1. Do note that using the typed syntax for object values, e.g. "926808422"^^xsd:integer, requires a “.” (dot) as a closing marker of the pattern term. Otherwise, the query will return no results, at least at 11gR2.0.1, for whatever reason.

    P

    Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s