Demonstrator for data integration case study
The demonstrator is a SPARQL 1.1 query builder, which forms part of an exploratory investigation of the semantic integration of extracts from archaeological datasets with information extracted via Natural Language Processing (NLP) across different languages. The case study is based on a loose theme of archaeological interest in wooden objects and their dating via dendrochronological techniques. The work was undertaken by the University of South Wales on the technical side, in collaboration with Data Archiving and Networked Services (DANS) and the Swedish National Data Service (SND) as regards Dutch and Swedish archaeological datasets, reports and vocabularies. The work formed part of the EU FP7 Infrastructures project, ARIADNE (Advanced Research Infrastructure for Archaeological Dataset Networking in Europe).
The case study investigated the feasibility of semantic interoperability between archaeological datasets and data derived from applying NLP information extraction techniques to grey literature reports in different languages. It has a broad theme relating to wooden material including shipwrecks, with a focus on indications of types of wooden material, samples taken, wooden objects with dating from dendrochronological analysis, etc. The resources comprise extracts from 5 English and Dutch language datasets together with grey literature archaeological reports from the Netherlands, Sweden and UK. Archaeology Data Service (ADS) datasets include two shipwreck datasets - the Newport Medieval Ship and the Mystery Wreck Project (Flower of Ugie), together with the Vernacular Architecture Group dendrochronology and cruck databases. DANS facilitated an extract from the database of the international Digital Collaboratory for Cultural Dendrochronology (DCCD). The data are extracts from these databases provided for demonstration purposes and should not be regarded as current or complete.
The semantic framework combines the CIDOC Conceptual Reference Model (CRM) with the Getty Art & Architecture Thesaurus (AAT). The demonstrator is a Web application that hides the complexity of the underlying semantic framework; it seeks to show that alternative user interfaces are possible for RDF applications. As the user selects from the interface, an underlying SPARQL query is automatically constructed in terms of the corresponding ontological entities. It is possible to search across all datasets (the default) or select a dataset to search individually. A set of interactive controls offer search and browsing of the extracted archaeological data. The controls are designed to be browser agnostic and the Demonstrator will run in most modern internet browsers.
The demonstrator can perform semantically structured queries, free-text queries, or a combination of both. Drop-down lists of all datasets, AAT materials and AAT object types used in the data are populated at startup, and a dual slider control is initialized to represent the minimum and maximum years for any object production dates present in the data. Hierarchical expansion has been implemented over the semantic structure of the Getty AAT and results from narrower concepts are included when available.
The NLP focus is on concepts relevant to the theme, such as samples, materials, objects and temporal information. The NLP output from the English, Dutch and Swedish reports was transformed to the same RDF format as the instance data extracted and mapped to the CRM/AAT. NLP derived RDF statements do not necessarily carry the same degree of reliability as those derived from the datasets (the Dutch and Swedish NLP pipelines are at a prototype stage). The NLP outcomes include some false positives.
The demonstrator is available at http://ariadne-lod.isti.cnr.it/demonstrator.html
The Demonstrator source code is available (open source). For more information on the case study and the techniques involved, see ARIADNE D15.3 (Report on Semantic Annotation and Linking) and for a discussion of the NLP techniques see ARIADNE D16.4 (Second Report on Natural Language Processing). Both Deliverables are available at http://www.ariadne-infrastructure.eu/Resources
Example Queries
Query on records referring to AAT concept "Salix (genus)" with multilingual results from databases and reports. Some results refer to "willow" (the wood from trees of genus Salix) leveraging query expansion across relevant AAT semantic relationships).
Query on objects of type "roofs" with a production date in the range 1500 to 1600 AD. Results derive from reports via NLP including some instances where the process has associated an object with a date (or a date range).
Query on samples of objects of type "keels" made of material "Fagus (genus)"