Resource allocation when importing a yago2 ontology into an oracle semantic database


This is a short review of the resources nessecary when importing a full yago2 ontology, which is about 200 mio triples, into an oracle semantic database. Some information and snippets about the way to execute the import is given but this is not the main focus of the article. I’m still on a fresh, not otherwise loaded 11gR2.0.1 database.

The way to execute the import mainly followed the instructions given with the rdf demos, that is using sqlldr and bulkload.ctl to populate a staging table as proposed in ORACLE_HOME\md\demo\network\rdf_demos and afterwards employ sem_apis.bulk_load_from_staging_table() to actually load the data into the sematic net. bulkload.ctl has in fact not being changed anyway, the yago2 data being supplied in nt triples formatting like this:

<Embeth_Davidtz> <http://yago-knowledge.org/resource/actedIn> <Army_of_Darkness> .

and the staging table:

create table yago2_stage (
  RDF$STC_sub varchar2(4000) not null,
  RDF$STC_pred varchar2(4000) not null,
  RDF$STC_obj varchar2(4000) not null,
  RDF$STC_sub_ext varchar2(64),
  RDF$STC_pred_ext varchar2(64),
  RDF$STC_obj_ext varchar2(64),
  RDF$STC_canon_ext varchar2(64)
) compress;

and the sql loader call:

sqlldr userid=lucene/*** control=yago2.ctl data=yago2_1.nt direct=true skip=0 load=95000000 discardmax=10000 bad=yago2.bad discard=yago2.dis log=yago2.log errors=0

and finally the load from the staging table into the sematic net:

grant insert on yago2_data to mdsys;
grant select, update(rdf$stc_sub_ext, rdf$stc_pred_ext, rdf$stc_obj_ext, rdf$stc_canon_ext) on yago2_stage to mdsys;
ALTER SESSION SET NLS_DATE_LANGUAGE = American;
ALTER SESSION SET NLS_NUMERIC_CHARACTERS = '.,' ;
begin sem_apis.bulk_load_from_staging_table('yago2', 'lucene', 'yago2_stage'); end;
/

Archive log

Both reading the file based yago2 data (of two files around 10 gb) as well as bulk loading the data into the net produced around 130 gb of archive log. While reading the data took around 40 minutes and loading the data took around 6 hours, there were peaks of 10 archive log files of 100 mb per minute. This may not be much in terms of production systems but it may be a regardable amount if you plan to run an import in parallel to other processes on a test database.

Undo

The maximum undo allocation, obviously when establishing the link structures of the semantic net, was around 40 gb.

Temp

Temp segments went up to 27 gb for hashing and sorts and stuff.

MDSYS

The segments with mdsys where the semantic net lives, finally reached around 50 gb.

That’s it, nothing more nor less.

Have fun.

Want more? Update

Entailment generation for the yago2 classes topped out the peak one more time and flushed the temp space up tp 80 gb as follows:

SQL> set timing on
SQL> BEGIN
  2    SEM_APIS.CREATE_ENTAILMENT(
  3      'rdfs_yago2',
  4      SEM_Models('yago2'),
  5      SEM_Rulebases('RDFS'));
  6  END;
  7  /
PL/SQL-Prozedur erfolgreich abgeschlossen.
Abgelaufen: 09:30:05.17

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s