«1. Introduction This is a draft of a reference model for mapping tools. The intension is to address a comprehensive and sustainable functionality in ...»
A Reference Model for Data Mapping tools
Draft Update August 2013
This is a draft of a reference model for mapping tools. The intension is to address a comprehensive
and sustainable functionality in a scenario of information providers and aggregators, including the
long-term maintenance of resources. We assume a distribution of responsibilities, in which the
information provider curates his/her resources and provides in regular intervals updates, whereas
the aggregator is responsible for the homogeneous access to the integrated data and the resolution co-references (multiple URIs for the same thing) across all provided data. In the course of the transformation of resources to the target system, some kinds of quality control can be done which the provider has no means to do (see also the services provided by OCLC). Therefore the information provider receives and benefits from data cleaning information produced by the aggregator. The challenge is to define a modular architecture of as many components as possible that can independently be developed and optimized with minimal interfaces between them, without hindering integrated UI development for the different user roles involved.
The first part of this model is a sort of requirements specification, which breaks down in the usual way into a definition of the involved user roles, the primary kinds of data the users aims at handling and the complete definition of the processes users carry out to manage these data.
2. Process Model In the following, we specify user roles and data objects in order to have a vocabulary to define user processes. Of course, these things are interdependent, so a certain redundancy is inevitable.
We do not regard IT processes as self-contained and opposed to user processes, but IT processes are regarded as being part of the user processes replacing manual work. Once such a model has been defined, it allows for a dynamic definition of which user processes are replaced or assisted by IT processes, and to justify IT processes in terms of their utility for the functions users are ultimately interested in.
2.1 User Roles
Primary user roles are:
Provider: We call providers in this model the maintainers of Local information systems. In this model, we may also call them simply source systems. Following CIDOC CRM v5.0, “These are either collection management systems or content management systems that constitute institutional memories and are maintained by an institution. They are used for primary data entry, i.e. a relevant part of the information, be it data or metadata, is primary information in digital form that fulfils institutional needs”. In practice and even more general, these are individual museums, archives, libraries, sites and monument records, academic institutes, private research societies etc., represented by their curators, IT referents or researchers. Providers ultimately have the knowledge about the meaning of their data in the real world (if anybody has it), or know who knows, or know how to verify it.
Aggregator: We call aggregators in this model the maintainers of Integrated access systems. In this model, we may also call them simply target systems. Following CIDOC CRM v5.0, “These provide an homogeneous access layer to multiple local systems. The information they manage resides primarily on local systems.” Aggregators will maintain a form of business agreements with providers to send data from local systems to the aggregators’ systems, primarily metadata. In this model, we are not interested in aggregators doing harvesting without any business interaction with the provider. The model will be useful for such scenarios in a trivial way we do not explicitly describe. Aggregators have no direct knowledge about the meaning of the data they aggregate.
Secondary user roles are experts whose knowledge or services contributes to the mapping
Source Schema Expert: The curator, researcher or data manager of the local system who is responsible for the semantically correct data entry into the local system, i.e., the one who knows how fields, tables or elements in the schema correspond to the reality described by them following local use and practice.
Target Schema Experts: The expert(s) for the semantics of the schema employed by the aggregator (“integration model”). It is very likely that the aggregator uses a more widely known standard schema. Typically but not exclusively, we talk in this model about the CIDOC CRM and extensions of it. Therefore the target schema experts may not need to have intimate knowledge of the aggregator’s context. Moreover, there does not yet exist an established practice of a curator role at the aggregator side. Nevertheless, the requirement for semantic consistency in practice forces such a role to exist de facto, often a user team including some provider representatives. Therefore the target schema experts should include or be in close contact with a sort of curators of the integrated access system to fulfill this role.
URI Expert: The expert of the aggregator, normally an IT specialist, who is responsible for maintaining the referential integrity of the (meta)data in the integrated access system and who knows how to generate from provider data valid URIs for the integrated access system.
Source Terminology Expert: The curator, maintainer or other expert of one of the terminologies which the provider use as reference in the local system. If the terminology is provided by a third party, such as the Getty Research Institute, there may exist independent experts trained in this terminology. If it is local or even uncontrolled, it is typically the curators or other local data managers who know the meaning of the local terms.
Target Terminology Expert: The curator, maintainer or other expert of one of the terminologies which the aggregator uses as reference in the integrated access system.
Aggregator normally want to avoid to engage in the terminology maintenance business.
They will rather use and refer to third party terminology, or take over provider term lists.
Mapping manager: The actor responsible for the maintenance of the data transformation process from the provider format to the aggregator format. This role may split into a semantic and a technical part, and may be regarded as either an aggregator task, a provider task or a user consortium’s task. To our opinion, the mapping technology this model aims at should enable a scalable management of the data transformation process by the aggregator.
2.2 Data objects
We distinguish the following primary data objects:
Source systems: I.e., local information systems in the sense of the CIDOC CRM v5.0 (“These are either collection management systems or content management systems that constitute institutional memories and are maintained by an institution”) from which (meta)data are sent on a regular base or in a single action to some aggregator. We are interested here in their typical role relative to the processes we describe, regardless if they also may play some target system role to a third party.
Target system: i.e., integrated access systems in the sense of the CIDOC CRM v5.0 (“These provide an homogeneous access layer to multiple local systems. The information they manage resides primarily on local systems”), to which (meta)data are sent on a regular base or in a single action by several providers. We are interested here in their typical role relative to the processes we describe, regardless if they also may play some source system role to a third party.
Terminologies: Controlled vocabularies of terms that appear as individual data values in the source or target systems. We do never use the term “vocabulary” for metadata schemata. Terminologies may be flat list of words or be described and organized in more elaborate structures as so-called “thesauri” or “knowledge organization systems”, the most popular format now being SKOS. Here, we do not use the term “ontology”, even if the terminology may qualify as such, as long as its use in this context is to provide data values.
Content objects: Individual files or information units with an internal structure that is not described in terms of schema elements of the source or target systems. These are typically images or text documents, and searched by content retrieval indices such as keyword search, rather than by associative queries. They are described as objects by metadata records which are searched by associative queries. Important in this context is not the actual structure of an information unit to be qualified as a content object, but the way it is treated in the information system (sometimes also called “blobs” ). Many aggregators do not collect content object but only link them back to the provider system.
Metadata records: Information units with an internal structure that is described in terms of schema elements of the source or target systems. In our context, these are often data records describing content object (therefore the term “metadata”), but bad analogy brought the term also into use for data describing physical objects and other historical contexts. Therefore we define it here by the way it is treated in the information system, and not as “data about data”. The metadata records are the common target of submission to aggregators and therefore of transformation from the source to the target schema.
Secondary data objects are those that support the mapping processes:
Source schema definitions: data dictionaries, XML schemata, RDFS/OWL files etc.
describing the data structures that are managed and can be searched by associative queries in the source or systems.
Integration Model: The definition of the schema of the target system, now mostly an RDFS/OWL knowledge representation model (“ontology”).
Other kinds of data objects which are part of this reference model in the sense of products or interface definitions of the components it foresee, such as schema matching definitions etc. (see below)
2.3 Mapping processes 2.3.1 Overview This reference model aims at identifying, supporting or managing the processes needed to be executed or maintained when a provider and aggregator agree (1) to transfer data from the provider to the aggregator, (2) to transform their format to the (homogeneous) format of the aggregator, (3) to curate the semantic consistency of source and target data and the global referential integrity and (4) to maintain the transferred data up-to-date with whatever relevant changes occur in the source and target systems and the employed terminologies.
Note that experienced aggregators keep the original data from the provider, so that they can reexecute the data transformation process without asking for resubmission.
Figure 1: Mapping Processes diagram
At a first level, this breaks down into the following independent processes:
(a) Management of which data will be delivered and processed at what time, including updates.
(b) The mapping definition, i.e., specification of the parameters for the data transformation process, such that complete sets of data records can automatically be transformed, manual exception processing notwithstanding. This includes harmonization between multiple providers.
(c) The actual transfer of data until a first consistent state is achieved. This includes transformation of sets of data records submitted to the aggregator, the necessary exception processing of irregular input data between provider and aggregator, ingestion of the transformed records into target system and initial referential integrity processing possibly on both sides.
(d) Referential integrity processing at the aggregator side out of the context of a particular data submission, which is not our concern in this model.
(e) Change detection and update processing to restore ability of data transformation and semantic consistency, which comprises changes in the source target records, in the source or target terminologies, in the source or target schemata, in the target URI policy and in the good practice of interpretation of source and target schema in the mapping definition.
Only if these processes are sustained, an aggregator can provide valid and consistently integrated data in long terms, and thereby deliver the full added value of an aggregation service as a resource for professional and private research, which ultimately justifies its existence. We observe that absolutely none of the dozens or hundreds of mapping tools and frameworks created in numerous projects has ever systematically addressed this comprehensive scenario.
2.3.2 Analytical Representation
220.127.116.11 Data Delivery Management Data Delivery Management deals with which data will be delivered and processed at what time, including updates. A Mapping Manager may be responsible for this task, possibly with a custodial participation or supervision on provider and aggregator side.
Here we do not further analyze the subprocesses which do not affect the mapping itself. In general, IT- support will be given by log-files of delivered content, facilities to query (last) changes of source system records or other data requiring resubmission and queries to support selection criteria in the source systems. Also, queries in the target system may be used to reveal semantic needs in the composition of the aggregation and to derive requests for certain materials from providers.
There is however a set of characteristic changes in the provider – aggregator environment that affect the mapping and may require redefinition of the mapping reexecuting the transformation of records already submitted to the aggregator and updating the transformed records in the target system resubmission of records from the source system.
The mapping manager must monitor such changes and initiate respective actions.
18.104.22.168 Mapping Definition Mapping definition breaks down into Syntax normalization Schema matching, URI generation specification Terminology mapping The Mapping Manager may be responsible for issuing and coordinating these tasks.