After a longer than anticipated gestation, my Transfer Report has left my hands and is working its way through the administrative system to be externally examined. Fingers crossed, this is one of my last posts as an MPhil student and I will soon (post viva) be a PhD student proper.
The Transfer Report included a condensed form of the literature review and also a detailed report on Pilot Study. This Pilot Study was designed to lay sound foundations for the PhD research and involved implementing a system using geosemantic technologies, primarily to investigate ways in which semantic and geospatial data can work together but also to help me get to grips with the subject area and technologies available.
The full report will be made available in due course, once it has been examined (viva scheduled for end of November) and any corrections completed, but for now here is an update on some of the key findings of the Pilot Study and conclusions drawn.
Conclusion one: Oracle is really complicated
I started off with the idea that using Oracle for the research would be a really good idea. It is available for use under license for research purposes (OTN Developer License) and is the don of the database world. Furthermore, it does everything I required, all in one platform; no need to string bits of open source software together I thought with their undocumented ‘features’ and sparse documentation. After all, Oracle is a commercial system, an enterprise level system which supports all the relevant standards for geospatial, semantic and geosemantic data. It is capable of functioning as a triple store, a geospatial database and integrates with the Jena framework by means of a dedicated connector. The latest version (12c) has also been significantly redesigned and improved with respect to the Spatial and Graph components.
This is true, but being an enterprise level application, it also comes with considerable baggage. Notably, it is really, really complicated and much of this complexity is totally unnecessary for the likes of me undertaking a research project.
Now I don’t want to be unnecessarily critical of the platform but there are some real issues with using it for a research project such as GSTAR. Installation and configuration for starters is necessarily complex as it supports some seriously powerful tools such as security, distributed/pluggable databases, user/group roles and permissions not to mention Extended Data Types (essential for handling big data such as WKT geometries) and Indexing thereof. For a research project, components such as the enterprise level security are quite simply a hindrance rather than a help not to mention indexing. More critically, I found working with the Jena Connector and GeoSPARQL to be fraught with the (copious) documentation for the new version being a bit lacking; forums and blogs were of enormous help in fixing problems where the documentation wasn’t quite as helpful as it might have been for working with this latest version. No doubt this will bed down given time but being at the bleeding edge of such technology was not an ideal place to be.
Given I’m no longer using the Spatial and Graph components, the use of Oracle as the spatial database is no longer useful. Indeed, I won’t be using a spatial database as such with all data being prepared as Linked Geospatial Data within the triple store.
So, it’s been an experience but goodbye Oracle. Thanks for all the fish.
Conclusion two: Open Source software can be really good
Still smarting from my Oracle experiences and quite a long way down the road with less than I had hoped to show for my troubles, I returned to my initial review of triple stores, looking for a suitable alternative. My requirements are quite specific: The platform needs to support big data, be responsive, support inferencing/reasoning and, crucially, provide good support for GeoSPARQL. I recalled various papers from my literature review extolling the virtues of Parliament, other folk having used it on similar research projects. It also has a thoroughbred pedigree, originating from research initially undertaken by DARPA through the DARPA Agent Markup Language (DAML) Program and is now used as the base for applications in a range of tough, testing environments by Raytheon BBN Technologies. So an impressive pedigree.
My concern regarding documentation, having worked with various Free and Open Source Software (FOSS) platforms over the years, was still niggling, but it had to be worth some testing. After all, quantity does not necessarily equate with quality, as the Oracle experience demonstrated. And there certainly isn’t quantity: the manual (a single rather short document) is smaller than the document for the Oracle Extended Data Types functionality! The key difference is Parliament is a one trick pony, and it does that trick very well; It does not try and be all things to all users. Installation and configuration was simple as pie, with the user guide providing all the key information without excess baggage. True, some of the latter sections of the user guide are yet to be written (almost a prerequisite for a FOSS application, a bit like for web 2.0 apps where permanent beta status is a badge of honour) but these focused on highly specific aspects of deployment irrelevant to my research.
So within a day, I had gone from review of my systems review to a working system.
Conclusion three: geosemantic applications using GeoSPARQL can really fly!
One aspect to the Pilot Study was an investigation into different ways of integrating semantic and geospatial data. Without going into too much detail (I’ll post a version of the Transfer Report once it’s been examined, I’ve had my viva and everything is finalised), I had a suspicion that working with geospatial data using semantic tools and verbose, text based formats such as GML and WKT would be lacking in the performance department. Especially given some of the criticisms levelled at the performance of some SPARQL implementations and compared with the highly tuned geospatial tools found in GIS and dedicated geospatial platforms. So I wrote a Java application to test this hypothesis, comparing a ‘hybrid’ SPARQL+WFS system with a pure geosemantic system based around GeoSPARQL. The results of this showed very little difference in performance between the two approaches, potentially as any benefits of the optimised geospatial components appeared to be outweighed by overheads associated with additional middleware to process geosemantic queries for the GIS and then handle the WFS outputs to produce RDF. Given this lack of any significant benefits combined with the need for more complex systems architecture, I have opted for a ‘pure’ geosemantic basis for my next stages, based around Parliament, Jena, Joseki/Fuseki and GeoSPARQL, cutting out the need for any RDBMS, GIS and associated web servers.
So, a big chunk of my research project is now complete and all being well, I should have my transfer from MPhil to PhD all signed and sealed in the near future. The Pilot Study has provided the groundwork for the next phases of work as detailed above and work on the next Case Studies is already well underway. I have the first tranche of data from Wiltshire Historic Environment Record in hand which is currently being processed to produce a geosemantic resource in Parliament; other data from archaeological units and museums is being sourced with the aim of completing this integration and preparation phase by Christmas.
Historic Environment Record data
The HER data is being prepared using the CIDOC CRM ontology with CRM EH extensions supported by a lightweight GeoSPARQL integration to provide the necessary geosemantic framework; more on the CRMEH-GeoSPARQL integration here. The production of Linked Data to feed Parliament is once again being accomplished using the workflow developed through the Pilot Study, based around the STELLAR toolkit and the StringTemplate engine.
Case Studies and further investigations
The Case Studies will then look at the integration of these datasets using inferencing/reasoning on the spatial and other facets, moving from fieldwork data up to heritage asset inventories and across to museum collections, specifically how such linked resources can be used to undertake archaeological research based on current archaeological research questions and also including the use of RDF mapping libraries and query mediation using (spatial) ontologies.
I now have a draft chapter outline agreed for my thesis and already have tens of thousands of words to edit into it pertaining to the Literature Review, Pilot Study, introductory and methodology chapters. In other words, full steam ahead!