Menu

Ontotext

GATE Components and Applications

GATE (General Architecture for Text Engineering) is the world's most popular software platform for language engineering, developed by the NLP group of the university of Sheffield.

Japec

Japec is a JAPE-to-Java compiler, packaged as a processing resource for GATE 3.1+. The processing resource (Ontotext Japec Transducer) is designed as a replacement for the standard JAPE Transducer, provided for optimized performance. The implementation is based on finite state machines and the compiler uses the standard algorithms for determination and minimization in order to achieve better performance. The actual compiler is a standalone executable written in Haskell because it involves complicated algorithms with dynamic data structures. The processing resource wraps the compiler and translates the grammar under the hood.

Japec is proven to be 2 to 5 times faster than the standard JAPE Transducer. The code of Japec is donated to the GATE project and community under the LGPL license. The GATE team have adopted Japec internally, and are now working with Ontotext to port Japec to Java for release 4 of the system.

Contributions to GATE 2.0

Ontotext is a core participant GATE development since release 2.0 with a number of tasks, among which:

  • GATE's Oracle support - work on the production version of the GATE's support for ORACLE data stores. The work included analysis and redesign of the relational schema, implementation and optimization of the supporting code, installation and testing. The scalability requirements related with the hosting of huge corpora such as BNC turned this task into a challenging VLDB problem.
  • Optimizations - performance analysis and optimization of GATE, concentrating on the IE pipeline. The result of couple of man-months work was over two times speed up and memory reduction (without considering the HashGazetteer, see below.)
  • WordNet API - integration of world most popular lexical database. WordNet is currently available within GATE both via Java API (similar to the original one for C) as well as through GUI allowing the standard queries to it;
  • Protege-2000 integration - the most popular ontology editor can now be run within the GATE development environment;
  • Ontology access, editing, and markup - support for editing viewing and using ontologies. Basic features: DAML+OIL storage, access via API, ontology-based feature-map matching allowing grammar rules taking benefit from hierarchical annotation and look-up types;
  • Information Retrieval - fully-featured Information Retrieval (IR) subsystem that allows full-text queries to be performed against GATE corpora. The current implementation is based on the most popular open source full-text search engines - Lucene;

Most of the above mentioned enhancements are part of the GATE distribution and come free. There are also number of Ontotext proprietary tools and applications which are not free for commercial use. However, those are still free for research or education purposes.

Gazetteers

A gazetteer is a Java-implemented lookup tool that allows occurrences of strings from predefined lists to be found in texts. The critical issues about such routines are the speed and memory usage. A classical implementation approach uses a finite state machine (FSM) recognizer or a kind of suffix-tree. Ontotext has developed several gazetteers for GATE.

  • Hash Gazetteer: based on hash tables instead of FSM. On average it takes four times less memory and works three times faster than an optimized FSM implementation. It is available as a CREOLE component in the GATE distribution.
  • Stand-Alone Gazetteer: version of the Hash Gazetteer that can be used without GATE. In this shape it is a Java library that can be used with minimal efforts in any applications that need to lookup huge lists of strings in text in a time and memory efficient manner.
  • Large Knowledge Base (LKB) Gazetteer: new-generation gazetteer.
  • Linked Data Gazetteer: an experimental gazetteer that uses Linked Open Data for lookups. Please contact us if you would like to experiment with it.

Hidden Markov Model Learner

A stochastic module capable in filtering annotations, disambiguation, and other "soft" tasks based on confidence measures. It is based on Ontotext's proprietary HMM implementation tuned for Information Extraction applications - see [Scheffer et al 2002] for the most general ideas. Contact us at info-at-ontotext.com for more information and applications.