This is a short review of the resources nessecary when importing a full yago2 ontology, which is about 200 mio triples, into an oracle semantic database. Some information and snippets about the way to execute the import is given but this is not the main focus of the article. I’m still on a fresh, not otherwise loaded 11gR2.0.1 database.
The way to execute the import mainly followed the instructions given with the rdf demos, that is using
bulkload.ctl to populate a staging table as proposed in
ORACLE_HOME\md\demo\network\rdf_demos and afterwards employ
sem_apis.bulk_load_from_staging_table() to actually load the data into the sematic net.
bulkload.ctl has in fact not being changed anyway, the yago2 data being supplied in
nt triples formatting like this:
<Embeth_Davidtz> <http://yago-knowledge.org/resource/actedIn> <Army_of_Darkness> .
and the staging table:
create table yago2_stage ( RDF$STC_sub varchar2(4000) not null, RDF$STC_pred varchar2(4000) not null, RDF$STC_obj varchar2(4000) not null, RDF$STC_sub_ext varchar2(64), RDF$STC_pred_ext varchar2(64), RDF$STC_obj_ext varchar2(64), RDF$STC_canon_ext varchar2(64) ) compress;
and the sql loader call:
sqlldr userid=lucene/*** control=yago2.ctl data=yago2_1.nt direct=true skip=0 load=95000000 discardmax=10000 bad=yago2.bad discard=yago2.dis log=yago2.log errors=0
and finally the load from the staging table into the sematic net:
grant insert on yago2_data to mdsys; grant select, update(rdf$stc_sub_ext, rdf$stc_pred_ext, rdf$stc_obj_ext, rdf$stc_canon_ext) on yago2_stage to mdsys; ALTER SESSION SET NLS_DATE_LANGUAGE = American; ALTER SESSION SET NLS_NUMERIC_CHARACTERS = '.,' ; begin sem_apis.bulk_load_from_staging_table('yago2', 'lucene', 'yago2_stage'); end; /
Both reading the file based yago2 data (of two files around 10 gb) as well as bulk loading the data into the net produced around 130 gb of archive log. While reading the data took around 40 minutes and loading the data took around 6 hours, there were peaks of 10 archive log files of 100 mb per minute. This may not be much in terms of production systems but it may be a regardable amount if you plan to run an import in parallel to other processes on a test database.
The maximum undo allocation, obviously when establishing the link structures of the semantic net, was around 40 gb.
Temp segments went up to 27 gb for hashing and sorts and stuff.
The segments with
mdsys where the semantic net lives, finally reached around 50 gb.
That’s it, nothing more nor less.
Want more? Update
Entailment generation for the
yago2 classes topped out the peak one more time and flushed the temp space up tp 80 gb as follows:
SQL> set timing on SQL> BEGIN 2 SEM_APIS.CREATE_ENTAILMENT( 3 'rdfs_yago2', 4 SEM_Models('yago2'), 5 SEM_Rulebases('RDFS')); 6 END; 7 / PL/SQL-Prozedur erfolgreich abgeschlossen. Abgelaufen: 09:30:05.17