It would be very interesting to use Semantic Web (SW) tools and languages like RDF and OWL for tasks traditionally performed with XML markup. Among these is textual encoding of literary documents and manuscripts [1], that face the problem of overlapping markup, not naturally solved by XML. As practical and scientific interest into machine aided literature analysis and processing raises, pure XML textual encoding formats prove to be limiting under several points of view. A number of successive enhancement proposals has shown a clear trend: given the strictness of the XML model, for advanced encoding, XML compliance must often be abandoned and in any case standard XML processing tools (e.g., query languages) prove inadequate. In this project we will investigate the feasibility of using the new methodology based on Semantic Web markup languages RDF [2] and OWL [3], with special interest on how to solve the "overlapping markup" problem of classic XML markup. To this end we will implement a framework, allowing interested people to test the idea.
First, we observe that, at least in theory, RDF is suitable to fulfill all the tasks that have been traditionally done in XML. The evaluation is then not in terms of what aspect can or cannot be encoded (as they all basically can) but rather how easy it is to cover the useful use cases, given the standard tools already available for the specific markup languages. Also, to facilitate large and multidisciplinary use of machine encoded material, it is important to evaluate to which extent the tools we are using are in fact meant to serve that purpose, i.e., backed up by a solid and well understood semantic, rather than serve as a pure placeholder which means nothing without a complex, idiosyncratic set of explanations and rules. With respect to this, the use of SW tools (RDF, OWL) is simply the next logical step. The framework to be implemented by this project will allow markup based on these principles.
RDF forms the basis of the Semantic Web. This language defines a method to connect resources and data values, creating semantic networks [2]. RDF's main strength is simplicity: it is a network of nodes connected by directed and labelled arcs. Arcs are used to express properties of resources.
RDF is made for data modeling, and is a well understood conceptual model. It has good maintainability and readability issues, and can work in a distributed way. The distributed and resource centric nature of RDF also enables a novel cooperative annotation scenario, where different encoding "facets" of the same text can be naturally merged. There are lots of generic Semantic Web tools to help building RDF textual encoding software, such as Semantic Web databases, ontology reasoners, rule systems, query languages, visualizers, etc.
Extending the model, we can add ontologies, using the standard OWL language [3], and so be able to use interesting and powerful capabilities for cross hierarchy relation, definition and validation. It is possible to work with multiple hierarchies, and so specific views and hierarchical subsets can be easily extracted.
However, with this model, RDF Textual Encoding can not be edited directly by hand, and a standardization effort is not realistic at the moment, since there is a huge amount of previous work and legacy standards. So, at least a query interface and import/export filters from existing encoding formats are necessary.
Thus, with this project, the aim is to deliver a lightweight API and a simple GUI to allow researchers and interested individuals to experience the idea, features and possibilities of RDF textual encoding. We would like to show that the added value in terms of expressive power, coherence, powerful tools that can be built with simplicity, and cooperation features enabled by the Semantic Web technologies, all outpeforms other alternatives. This software framework is meant to be the testbed and demostrator for this, and is the main deliverable of this project.
Other deliverables are:
The first steps involve the definition of a number of interesting use cases, and of a lightweight but sufficiently powerful "encoding ontology" to cover them. Initially, these steps are planned to be performed on the first two-three weeks after the project beginning. But naturally, they are subject to change during the development of the project.
The main use case will be the overlapping markup problem faced by annotated corpora building teams, like the one in "Tycho Brahe Project" [4], which contains the "Tycho Brahe Parsed Corpus of Historical Portuguese". In cases like these, there is usually the syntactic structure trees and the original manuscripts format, that have to be marked-up in the corpora. Also, original manuscripts, specially the ones of the Tycho Brahe Project, that were written on various different past centuries, contain orthography and words not used anymore in the language, what makes tasks like parsing and part-of-speech tagging of the corpora a little more difficult (since texts from different centuries had different language aspects). But in these cases, it is not desired to replace them by modern forms and synonyms, since they are important for linguistic studies. So this is also a markup problem, but can be a very interesting use case for our project (we can have two text hierarchies, one with the original orthography, and other with the modernized one).
For the encoding ontology, for cases like the above, we will start working on the ontology presented by [1]. Then we will proceed making any necessary adjustments. We plan to use Protege [5] for designing the ontologies.
To query a model like the one we are proposing, there are a large number of options. We can use existing SW query languages, like SeRQL [6], RQL [7] and the forthcoming SPARQL [8], or we can query a model programatically, using the existing manipulation tools such as Jena [9] and Sesame [10]. In the latter case, we can use an imperative language and a graph exploration API. This is certainly not the ideal combination, albeit a very popular one. However, once in the Semantic Web domain, other alternatives are available. One of the most powerful ones is using a Semantic Web aware Prolog interpreter such as SWI-Prolog [11]. Using one such language, it is possible to craft very powerful constructs for later reuse. So, we will make use of SWI-Prolog in this project.
As far as querying is concerned, the use of ontologies gives important capabilities. As a basic example, given an ontology for "manuscripts" including the original and the new orthography of the Tycho Brahe Project [4], we could be able to perform queries that contemplate all the sub cases such as "what is the most frequent word used on texts from the XVIII century that had different orthography on texts from the XVI century?".
This step of developing the query module will take from the second week after the project beginning to the first two weeks of August.
For other tasks, like import/export filters, and the user interface, we will use the Java language. Probably we will use the Eclipse platform, for quicker developing in Java. And since the project will be hosted on a repository, we will use CVS.
This step shall begin with the development of the query module, mainly because we will try to use other existing encoding formats to keep testing the framework. The user interface will be left to middle august, when the interface will be better defined.
The remaining weeks will be kept for additional tests, measures and adjustments. Work can begin as soon as the project is approved.
Considering the project is approved and begins by June 24th, and lasts until September 1st, its schedule is summarised below.
The time periods (Dates) are:
A: from project beginnig to July 10th;
B: from July 10th to August 10th;
C: from August 10th to September;
D: after September (or as soon as possible).
| Task \ Date | A | B | C | D |
|---|---|---|---|---|
| Definition of use cases | X | |||
| Definition of encoding ontology | X | |||
| Implementation of query module | X | |||
| Implementation of I/E filters | X | |||
| Implementation of GUI | X | |||
| Tests and adjustments | X | X | ||
| Writing of an article | X | |||
| Presentation | X |
We believe the overview of this project shows how the soundness of the semantic theory and the power of the available tools fully justifies further studies and analysis in this direction, investigating on a new encoding standard and how existing ones can be ported, made compatible or enhanced to the Semantic Web model.
After the development of this project, seamless integrations of Textual Encoding with other domains would be possible. This will increase the state-of-the-art in the area, and provide a base for further advancements.
My name is Fabio Natanael Kepler, and I am a Ph. D. student in Computer Science at University of Sao Paulo, Brazil. I am interested in Computational Linguistics, and currently I am working with probabilistic parsers for Portuguese. I have great interest in research, and my near future works include semantic analysis theory and applications for Portuguese, and semantic web tools in general. I have worked with part-of-speech tagging during my M.Sc. program, and in my final dissertation [12] I reported one of the best results in PoS tagging of Portuguese, with a 95.7% state-of-the-art accuracy, in the best published time of execution: less than three minutes for training the model with more than one million words and tags, and testing it on 200000 words. The corpora used was the Tycho Brahe Corpus [4]. I have great access to its maintaining team (so they will gladly help in this project by supplying their corpora, since they were facing the overlapping markup problem). During my academic career (six years) I have published about a dozen articles and papers. I am familiar with languages like Prolog, Java and C/C++, and have been programming for more than six years now. I have also worked with tools like Protege (designing ontologies and accesing them with Java), the Eclipse platform, and CVS. I know the basis of RDF and OWL languages, and will have no problem working with them. Among my capabilities is my ease of learning, and my great enthusiasm about challenges and problems. On weekends I like to play soccer, and sometimes the piano. :)