Making data available on the web is a useful and noble act. It really helps data usage, but it is not that easy to do for the data supplier. While working on SPIDER we have encountered several problems and challenges, which we tried to solve as good as possible. This leads to recommendations that we would like to share below. The recommendations are grouped into the different stages that one usually goes through when publishing data as Linked Data. With the recommendations come explanations of some of the choices made when publshing data for SPIDER.
Of course the web is a good place to find documents that give advice on how to publish data on the web:
According to the Linked Data paradigm, data on the web are identified and accessed using HTTP(S) URIs. Determining the URI's to use for the elements in your dataset therefore is an important step. It is also a step that should be taken with some care and foresight. When data are made available on the web, there is much you can change afterwards, but changing URIs, once published, is best avoided. Once URIs are added to the web, anyone can use them as references in other web publications. So it is bad form to stop URIs from working. This also means that once a URI identifies something, it should keep on identifying that same thing. You can change the data that describe a thing (resource) on the web, but its URI should be stable and persistent.
Making (minting) good URIs often starts with choosing the first bit, the host name part. What the host name looks like does not matter much, because URIs or parts of URIs do not need to be interpreted. What matters is making sure the host name can remain stable, to avoid having to change URIs once they are set free on the web. In the case of SPIDER, the domain spider-ld.org was bought by Geodan and it was decided to use http://data.spider-ld.org as the basis of all data URIs. We also thought that we were likely to publish multiple datasets, so a dataset identifier should also be part of the URI. This leads to http://data.spider-ld.org/kerkennl, in which kerkennl names a dataset. Because it could be possible that we would like to publish other things about this dataset next to its raw data (for instance nicely readable documentation), we expanded the URI path to http://data.spider-ld.org/kerkennl/data. This enables publishing documentation under http://data.spider-ld.org/kerkennl/doc, for instance.
In order to have a URI for each individual resource in the dataset we were fortunate to be able to use existing unique identifiers in the soource dataset. Such unique keys only identify one thing, and always the same thing, regardless of the changes a dataset may undergo. This makes them very useful to make part of a URI. In the URI http://data.spider-ld.org/kerkennl/data/kerk17, which leads to data about a particular church building, the number 17 is a primary key in a table in the source dataset (an Access file in this case).
For more advice on how to mint URIs see the chapter 'The Role of "Good URIs" for Linked Data' in Best Practices for Publishing Linked Data.
Metadata are data about data and are very useful to have. Having good and extensive metadata can assure your dataset is easy to find, easy to understand and easy to use. And it lessens the chance of misuse of data. So describing as many characteristics of data is highly recommended.
Metadata are a type of data, so they can be made available as Linked Data, using HTTP(S) URIs. Using commonly used vocabularies (collections of definitions) for metadata will help. Datasets can be about many different things, but metadata describing datasets usually cover the same topics, like
Some useful vocabularies for describing metadata can be found on the web. They nicely supplement each other:
An example of metadata describing a dataset are the metadata of the dataset "Church buildings in the Netherlands 1800-1970", which can be accessed using the URI of the dataset: http://data.spider-ld.org/kerkennl/data. In SPIDER, metadata are managed by simple maintaining a text file in Turtle-formaat (Turtle is notation for RDF that relatively easy to read and write for humans). The file can be edited with any text editor. Its contents can be validated using an online validator for Turtle (search for 'RDF turtle validator'), or on the command line, for example using 'riot', part of the Open Source project Apache Jena. In SPIDER, we import a metadata file into our data publication platform of choice (more about that in Serve data on the web) afer each change.
An important step in sharing data on the web is making the meaning known of the various elements in a dataset. On the web of data, the preferred method is using existing commong semantics. This archieves an important goal: semantic interoperability. Semantic interoperabilty facilitates finding and combining data from different sources and reduces the risk of misinterpretation of data.
On the web many vocabularies can be found that contain defintions of many different concepts. There are different kinds of vocabularies. Some contain only general terms, like the various upper ontologies that have been developed. Or schema.org, that is made available by makers of search engines to allow making data and web pages easier to index. There are also specialized vocabularies that only contain definitions of concepts in a certain domain. For example, with the publication of the dataset "Church buildings in the Netherlands 1800-1970" SPIDER has made use of vocabularies about cultural heritage:
It can be difficult to find all applicable semantics for your data on the semantic web. A useful resource for finding applicable vocabularies is Linked Open Vocabularies.
Should it really be impossible to find suitable existing semantics, then it is always possible to make something yourself and contribute that to the web. Several standards allow doing that. The Simple Knowledge Organization System (SKOS) is relatively simple, but lacks advanced options. More is possible when using RDFS and even more when using the Web Ontology Language (OWL). But using the latter properly does require quite some learning.
Next to finding appropriate semantics it can be challenging to apply semantics well, and doing so in a coherent manner. In other words, the structure and logic of a dataset should be linked to web semantics as well as possible. In SPIDER too this was difficult in some areas. A few problems and the solutions we found are described below:
Publishing data as Linked Data means data should use the Resource Description Framework (RDF). In RDF data are expressed as triples of subject, predicate and object. Those triples can be linked to from graphs. For several reasons, this is a handy way to experess data. But it is very different from the way most data are initially available: in the shape of tables. So usually it is necessary to transform research data from a table structure to an RDF structure.
From the previous step, finding (common) web semantics to describe data, the target RDF structure for publishing data on the web should have followed. Not only will general terms have been found (or coined), but also the way those terms relate to each other. For example, if semantics have been found for the concepts of 'book' and 'author', a term like 'was written by' could be needed to relate those terms. So let's assume we know how the data look like and how they should look like. The question then is: how can data be transformed from the starting shape to the target shape?
The need to transform tabular data to RDF is a common one and multiple solutions are available. To name a few:
SPIDER has chosen a pragmatic solution, based on the assumption that knowledge of SQL, the query language for tabular data, almost always is required for transformation to RDF. We chose a generally applicable method that only requires knowing SQL. The method doees require the data to be accessible via a database that supports SQL. A good Open Source relational database PostgreSQL. It has many ways of manipulating data, which is useful for transformation to RDF.
In the first SPIDER use case, orginal data are collected in an Access file. By means of the Foreign Data Wrapper, which is a PostgreSQL module for working with data external to the PostgreSQL database, it becomes possible to query the data through PostgreSQL, using SQL. This enables generating RDF triples. For example, the following SQL query
select '<http://data.spider-ld.org/kerkennl/data/kerk' || id || '> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Church> .' from churches;produces triples as below, as many as there are rows in the table 'churches'.
<http://data.spider-ld.org/kerkennl/data/kerk1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Church> . <http://data.spider-ld.org/kerkennl/data/kerk2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Church> . <http://data.spider-ld.org/kerkennl/data/kerk3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Church> .
The triples are written in N-Triples format. Like other RDF formats, N-Triples can be converted to other RDF formats, or be imported into an RDF storage medium.
A query like above can be executed on the command line, using psql:
psql -U postgres -A -t -d spiderdb1 -c "select '<http://data.spider-ld.org/kerkennl/data/kerk' || id || '> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Church> .' from churches;" > churches.nt
Excuting this line will cause the generated triples to be written to the file 'churches.nt'. Several of these lines can be put together in an exectable script. Executing such a script can transform an entire dataset to RDF.
The example SQL query above is relatively simple. The nice thing about using an advances database like PostgreSQL is that more complex queries kan made made to get the data in the desired shape. In the example query below geographic point coordinates are expressed in WKT format and rounded off at the same time, using PostgreSQL's PostGIS extension:
select '<http://data.spider-ld.org/kerkennl_extra/data/geom' || id || '_point_rd> <http://www.opengis.net/ont/geosparql#asWKT> ""<http://www.opengis.net/def/crs/EPSG/0/28992> ' || ST_AsText(ST_GeomFromGeoJSON(ST_AsGeoJSON(puntgeometrie_28992,0,0))) || '""^^' || '<http://www.opengis.net/ont/geosparql#wktLiteral> .' from churches_location;
If you have managed to obtain RDF data in a file, you could chose to use a web server to publish the file (and thereby the data) on the web. For the convenience of data consumers it would be good to offer the data in different file formats. Two recommended formats are HTML and JSON-LD. Formatting data as HTML pages is useful because web browsers know how to render HTML by default. That makes HTML a good format to provide direct insight into data for human users. Also, HTML can provide clickable hyperlinks that can be used to discover more data.
JSON-LD is useful because it is easy to process in web applications, because it is based on JSON. JSON-LD can be regarded as a format for RDF, just like Turtle or N-Triples. That means JSON-LD allows retrieving data as a graph.
When publishing data as files it is recommended to configure the web server to support content negotiation. This allows consumers to easily request data in the required format.
For small and simple datasets perhaps it is sufficient to serve data in file format. But using specialised software to publish Linked Data can greatly increase user friendliness, especially for large datasets. Platforms for publishing and managing RDF data can also offer advanced ways of analyzing, selecting and retreiving data. Different API's could be made available, SPARQL for instance, a standard query language for RDF data.
There are many platforms for publishing Linked Data to chose from. Sometimes such a system has its own storage capability, a triple triplestore or quadstore, optimised for storing and retrieving RDF data. A platform can also be linked to a relational database, with the platform taking care of conversion to RDF. Some platforms that can be used for free are:
And of course there are several platforms that have to be paid for.
SPIDER has chosen to use the Open Source project Marmotta because of its current functionality, its future support for GeoSPARQL, but above all because it is user friendly. Installing and activating Marmotta is not much work, making it a good choice for the humanities student wanting to share interesting data on the web.