Configuring the document repository: COREDB implentation over Oracle RDBMS

<< | Administration | Configuring the document repository: Apache Lucene implementation >>

This implementation of the document repository is intended to satisfy the information needs of the end-user. It combines the results of the Text Analysis with the advantages of the Statistical Data-mining and the Full-Text Search. This features are complemented by a search engine able to perform over the large scale.

Prerequisites

By design COREDB uses a relational database as a document storage. Current version is extensively developed and tested with the 10g2 release of the Oracle RDBMS. Yet it is expected to work properly with the former 10g1 release. You can use also the Oracle Express database which is free to download and use, but some of the indices will fail to create, thus reducing the overall performance.

When you setup KIM to use the COREDB type document repository, this will enable the CORE interface in the web UI and CORE-speciffic methods in the API. Besides you must configure the database connection properties. For that you must have created a dedicated tablespace and a database user. The tablespace must be the default one for the DB-user. The DB-user must have sufficient privileges to create and use the COREDB-speciffic schema objects. Granting DB-user with the DBA role is, of course, the easiest way to achieve that. Consult your DBA for details.

Basic configuration

After the DB requirements are met, review and setup the following configuration parameters, located in KIM/config/document.repository.properties

  • com.ontotext.kim.KIMConstants.DOCUMENT_REPOSITORY_TYPE = coredb

Selects COREDB as the document repository implementation, instead of the lucene-based one. Changing the document repository will NOT move any documents from the old repository to the new one. Thus, all the populated documents will be inaccessible. At the same time, they won't be deleted either, so you can return to your old document repository at any time.

  • com.ontotext.kim.KIMConstants.COREDB_CONNECTION_STRING = jdbc:oracle:thin:@//<server-address>:1521/<oracle-sid>

The connection string is used to connect to a database running on a certain server and accessible at a certain port with the given service name. As long as currently only Oracle databases are supported, the usage of the above connection string notation is recommended. On server side there must be a database created, also a database service and a listener must be running in order to access the database.

  • com.ontotext.kim.KIMConstants.COREDB_USER
  • com.ontotext.kim.KIMConstants.COREDB_PASS

These parameters set the user and the password to be used when connecting. The default tablespace assigned to the user will be used for data storage.

After setting up KIM, stop any running instances and then run startKIM_RebuildIndex.bat (.sh) from an open console in the bin folder of the installation. This will create the database schema and will preload entities. The process may take up to 30 minutes. Be careful, executing startKIM_RebuildIndex.bat (.sh) on a tablespace with existing COREDB data will delete it all. After KIM initializes, the COREDB document repository will be ready for use. Populate some documents and try the CORE interface.

Note that KIM will log some exceptions during initialization, if Oracle Express is being used, due to the fact that some advanced types of indices are not supported by it. Don't be alarmed if you see a stacktrace in the KIM output.

Performance configuration

  • com.ontotext.kim.KIMConstants.DOCUMENT_REPOSITORY_SYNCHRONIZE_COUNT=100
  • com.ontotext.kim.KIMConstants.DOCUMENT_REPOSITORY_OPTIMIZE_COUNT=100000

These parameters determine the count of documents after which automatic sync or optimize of FTS (full text search) indices will fire. The CoreDbAPI provides methods that take these parameters into account, ignore them, or explicitly force sync or optimize operation. What is important about those operations is that sync simply updates the index so that the content of the new entries is included and optimize rather rewrites it from scratch which may take a long time.

The other options in document.repository.properties control the Apache Lucene-based document repository

Page last modified on July 07, 2008, at 05:42 PM