Home
prasoon edited this page 3 years ago

Standards and guide to reading this wiki

Must(Not), Should(Not), May(Not)

Use of must, must not, should, should not, may, may not is as per RFC-2119

WIP

This wiki, a work-in-progress, would be distilled at a later stage to a specification document for the set of protocols for metadata exchange between participating institutions and the downstream systems.

Introduction

niosX is an adaptable backend, capable of serving any application that manages highly connected data, modelled using W3C Annotation Data Model. We overlay a semantic layer of NiosxConcepts over the annotation data model to capture the structure of an application in any given context.

The annotations capture the different ways in which the users can interact with the base object and the NiosxConcepts impose a context specific structure on top. The work for this project started as a discovery and interpretation tool for Milli which is a consortium of archives so let's take that as an example to understand its capabilities.

The base archival object is modelled as a MilliEntity which is an extension of the NiosxEntity and adds fields mandated by the ISAD(G) guidelines. This base object is enriched by annotations added by the users of the platform. niosX is conceptualised as a data graph with the MilliEntity(archival objects) and annotations as nodes. The edges of this data graph are the NiosxConcept that add a semantic layer to the entities. In this framework, a comment is a simple annotation with a TextualBody, connected to the MilliEntity with an edge Comment (a type of NiosxConcept). Similarly, ScopeContent, Tagging, Copyright are all types of NiosxConcept that form the edges that connect the object to the annotation. The user can even create a UserDefinedConcept to extend the semantic vocabulary already available; a GenericConcept is a plain-old annotation to be used in cases where a new concept is not needed and available concepts are too specific.

APIs for other domains can be easily constructed by providing contracts for a base object (a sub-type of NiosxEntity) and a semantic layer of NiosxContext(s).

The milli-discovery is an API provider that sits at the edge of the Milli ecosystem and is responsible for extraction of metadata from the participating institutions and making it available to downstream components that consume this metadata exposed via a GraphQL API. The users of the application can annotate these objects in different ways.

niosX Object Model

Figure 1 niosX Object Model

The Figure 1 above shows some building blocks of niosX. We begin with a metadata object (EAD3, RSS, etc.) and convert it to a NiosxEntity instance. The ingestion process may be able to extract other information from the metadata object, create annotations and connect them with the object using appropriate NiosxConcept(s). Post-ingestion, users would interact with the application and add more annotations to it.

Design concerns

Registering participating institutions

A list of participating institutions needs to be maintained. This list must be capable of expansion and contraction by users with appropriate authorization.

Harvesting metadata

Harvesting the metadata from the participating institutions needs to be handled by milli-discovery. The plausible strategies for harvesting are:

  • Data-sync performed at designated intervals.
  • On-demand sync performed as a batch job for a selection of the participating institutions.
  • On-demand sync performed per source.

Community contributed crosswalks for metadata exchange format

The design should incorporate mechanism for the community to contribute plugin like components that provide crosswalk from other standards to the milli-discovery specification. The challenge to such an implementation is data-type conversion which could be handled by limiting the data-types to a basic fixed number, defining type conversion macros. With the constraints in place, a community contributed crosswalk would be a simple mapping from source to target keys provided in a text based data exchange format like JSON for instance.

Data validation

A data validation utility should help the user with information to correct any validation errors. A utility to fix any superficial validation errors like source-to-target mapping of keys may be provided.

Metadata specification for data exchange

Milli-discovery expects the participating institutions to provide the metadata in a format that is understood by the application which is described in this section. At a later stage, tools may be provided to enable crosswalks between this specification and other standards like Dublin Core, DPLA MAP, etc.

Specification design concerns

Future proof

The specification must be structured as an aggregation of classes (object types) that makes the specification future proof as addition and elimination of fields is less likely to break the specification.

Abstract Data Model

The specification is defined as an abstract data model, however, it borrows from the design principles of GraphQL specification (which is an abstract query language itself) and can be implemented with any data-exchange format like JSON.

Harvesting metadata

Harvesting metadata, which is essentially a POST on the API endpoint, is modelled as a mutation. However, the metadata may not always flow into the system as a well-formed JSON conforming to the GraphQL schema. For such cases, a utility that ingests a csv (with the header providing key-value mapping information for instance), xml or other formats and, calls the mutation end-point internally must be provided.

Consuming metadata

Consumption of metadata would be available at the query endpoint of the schema.

This guide provides the following information:

  1. Abstract data model that defines various objects and their relationships with each other.
  2. Example XML structure of the expected metadata file that would be consumed by the application.

Data transformations

There are 3 types of data transformations that can cover a vast majority of source to target transformation while ingesting data:

  1. Source field to target field mapping.
  2. Source data type to a different target data type transformation (example: Date type to its Long form).
  3. A field in the source type mapped to an object in the target or vice-versa (example: FullName mapped to name : {firstName, lastName}

Most transformations to reconcile data from source and the expected format for ingestion can be handled as one of these or their complex combination.