Date

By Peter Sefton

A version of this post is also available at my website.

This is a presentation I gave at eResearch Australasia 2017-10-18 about the new Draft (v0.1) Data Crate Specification for data packaging I've just completed, with lots of help from others (credits at the end).

BACKGROUND

In 2013 Peter Sefton and Peter Bugeia presented at eResearch Australasia on a format for packaging research data(1), using standards based metadata, with one innovative feature – instead of including metadata in a machine readable format only, each data package came with an HTML file that contained both human and machine readable metadata, via RDFa, which allows semantic assertions to be embedded in a web page.

Variations of this technique have been included in various software products over the last few years, but the there was no agreed standard on which vocabularies to use for metadata, or specification of how the files fitted together.

THE PRESENTATION

This presentation will describe work in progress on the DataCrate specification(2), illustrated with examples, including a tool to create DataCrate. We will also discuss other work in this area, including Research Object Bundles (3) and DataConservency(4) packaging.

We will be seeking feedback from the community on this work should it continue? Is it useful? Who can help out? The DataCrate spec:

  • Has both human and machine readable metadata at a package (data set/collection) level as well as at a file level

  • Allows for and encourages inclusion of contextual metadata such as descriptions of organisations, facilities, experiments and people linked to files with meaningful relationships (eg to say a file was created by a particular machine, as part of a particular experiment, at an organisation).

  • Is a BagIt profile(5). BagIt(6) is a simple packaging standard for file-based data.

  • Has a README.html tag file at the root with bagit-style metadata about the distribution (contact details etc) with a link to;

  • a CATALOG.html file in RDFa, using schema.org metadata inside the payload (data) dir with detailed information about the files in the package, and a redundant CATALOG.json in JSON-LD format

  • Is extensible easily as it is based on RDF.

REFERENCES

Sefton P, Bugeia P. Introducing next year’s model, the data-crate; applied standards for data-set packaging. In: eResearch Australasia 2013 [Internet]. Brisbane, Australia; 2013. Available from: http://eresearchau.files.wordpress.com/2013/08/eresau2013_submission_57.pdf

datacrate: Bagit-based data packaging specification for dissemination of research data with useful human and machine readable metadata: “Make Data Crate Again!” [Internet]. UTS-eResearch; 2017 [cited 2017 Jun 29]. Available from: https://github.com/UTS-eResearch/datacrate

Research Object Bundle [Internet]. [cited 2017 Jun 16]. Available from: https://researchobject.github.io/specifications/bundle/

Data Conservancy Packaging Specification Home [Internet]. [cited 2017 Jun 29]. Available from: http://dataconservancy.github.io/dc-packaging-spec/dc-packaging-spec-1.0.html

Ruest N. BagIt Profiles Specification [Internet]. 2017 Jun. Available from: https://github.com/ruebot/bagit-profiles

Kunze J, Boyko A, Vargas B, Madden L, Littman J. The BagIt File Packaging Format (V0.97) [Internet]. [cited 2013 Mar 1]. Available from: http://tools.ietf.org/html/draft-kunze-bagit-06



DataCrate: Formalising || ways of packaging research ||    data for re-use and ||       dissemination ||  Peter Sefton, University of Technology Sydney

This is a presentation I gave at eResearch Australasia 2017-10-18.



Peter Bugeia and I talked about this 4 years ago. This year I got around to leading the effort to standardising what we did back then.



This presentation is structured as a story.

Back in June Cameron Neylon was annoyed



"More concretely I specifically have data from a set of interviews. I have audio and || I have notes/transcripts. I have the interview prompt. I have decided this set of || around 40 files is a good package to combine into one dataset on Zenodo. So my || next step is to search for some guidance on how to organise and document || that data. Interviews, notes, must be a common form of data package right? || So a quick search for a tutorial, or guidance or best practice? ||  || Nope. Give it a go. You either get a deep dive into metadata schema (and || remember I'm one of the 2% who even know what those words mean) or you get || very high level generic advice about data management in general. Maybe you get || a few pages giving (inconsistent) advice on what audio file formats to use."

When I saw this cry for help I contacted Cameron and offered to work with him.



"As a researcher trying to do a good job || of data deposition, I want an example of || my kind of data being done well, so I can ||   copy it and get on with my research"

More from Cameron.



There were no examples

But actually, there are no simple examples of how to organise "long-tail" data sets for publication. Research data management books will tell you about various metadata standards, but how do you enter the metadata and associate it with your data?



So we made one

Fast forward to this week ...

Cameron Professor Neylon has ||    published his dataset

https://doi.org/10.13039/501100000193

The dataset is available from Zenodo, an open data repository hosted by CERN.



It's a zipped-up BagIt bag



There's a catalog inside

This is a human-readable catalog that lists all the files in the data set.



With information about people, places, || licenses and their relationships to the ||                 files ||            in the DataCrate

And has information about their context and the relationships between them.



For example it shows that Cameron is the creator of the dataset. Note that Cameron is idetified by his ORCID ID: http://orcid.org/0000-0002-0068-716X. Using URLs to identify things such as people is one of the key principles of Linked Data.



With lots of useful info about || relationships between the files

Like this one is || a translation of ||  this other one

Here's an example of a relationship between two of the files - one is a translation of another.



And it's not just nice tables either

<div ||   resource="./data/.../WorkshopBookletParticipants.docx" ||   property="http://schema.org/translationOf"> ||   ... || </div>

The HTML contains RDFa embedded metadata. RDFa is a standard way of embedding sematics in a web page.



That's standard semantic web metadata ||        as used by search engines

RDFa, using the schema.org metadata vocabulary is widely used by search engines.



Movie times, opening times, recipes - these are all some of the things that search engines understand.



But that's not all.

There's programmer-friendly JSON || metadata: easy to look up Contact

This package also has JSON metadata.





"@graph": [ ||   { ||     "@id": "data", ||     "@type": "Dataset", ||     "Contact": { ||       "@id": "http://orcid.org/0000-0002-0068-716X", ||       "@type": "Person", ||       "Email": "cn@cameronneylon.net", ||       "ID": "http://orcid.org/0000-0002-0068-716X", ||       "Name": "Cameron Neylon" ||     },

The JSON is easily usable by programmers - getting the contact for this dataset for example is a simple operation.



And use the context to expand that to a ||         full unambiguous URI

But if needed, the simple "Contact" can be turned into a URI, as per LInked Data principles.



"@context": { ||  ... ||   "Description": "schema:description", ||   "License": "schema:license", ||   "Title": "schema:name", ||   "Name": "schema:name", ||   "Creator": "schema:creator", ||   ... ||   "TranslationOf": "schema:translationOf", ||   "Funder": "schema:Funder", ||   "Person": "schema:Person", ||   "Contact": "schema:accountablePerson", ||    ... ||    "schema": "http://schema.org/",

You can look up Contact in the DataCrate JSON-LD context and see that it maps to schema:accountablePerson



Contact -> schema:accountablePerson || schema:accountablePerson -> || http://schema.org/accountablePerson

Then you can map schema:Accountable person to http://schema.org/accountablePerson



And machine-readable BagIt checksums ||           to check integrity

There are also checksums for all the data files.



There's a Bagit manifest file.



Which lists all the files and their checksums, so the validity of the bag can be checked.



It's not so much a package as a

This package is like a gift from Cameron, to his collaborators, to other researchers and to his future self.



How did you do it?

.. to do this work ...



We used an experimental tool called ||            Calcyte

We used an experimental tool called Calcyte



I ran Calcyte on Cameron's Google Drive ||     share to create CATALOG.xlsx files

... I ran Calcyte on Cameron's Google Drive share to create CATALOG.xlsx files ...



Calcyte is experimental early- stage open source software written by my group (mainly me) at UTS.



Calcyte created spreadsheets which functioned as metadata forms that Cameron could fill out.



The spreadsheets are multi-sheet workbooks, giving us scope to describe not only data entities like files, but metadata entities such as people, licenses and organisations.



Cameron filled out the metadata

I ran Calcyte to create the human and ||       machine readable metadata

Rinse, repeat || (took a few goes)

We spent a couple of months working on this intermittently, it will be quicker next time, but this level of data description will always involve a fair bit of care and work, at least a few hours for this scale of project. It's also important to proofread the result, just as with publishing articles.



So what's special about this packaging ||              approach?

Human AND machine readable web- ||                 native ||             linked-data ||              metadata, ||    not just string-values in XML

The advantages of this approach are that the package has: Human AND machine readable web-native linked-data metadata, not just string-values in XML



This slide is a reminder of what the CATALOG.html file looks like, complete with its DataCite citation, which, when people start citing this, will add to Cameron's academic capital.



This work is based on previous efforts || l Cr8it - now being looked after by Newcastle.edu.au (via Western Sydney and ||    Intersect) https://github.com/digitalbridge/crateit/tree/develop || l HIEv https://github.com/IntersectAustralia/dc21 || l Mike Lake's CAVE repository. https://suss.caves.org.au/cave/ || Both of these are covered in our 2013 presentation at eResearch Australasia || It builds on other standards: || BagIt: https://tools.ietf.org/html/draft-kunze-bagit-14 || Schema.org http://schema.org

This work is based on previous efforts

Cr8it and HIEv are covered in our 2013 presentation at eResearch Australasia

It builds on other standards:



The format used in this demo is described in a draft specification.



TODO || (assuming people see the value in DateCrate) || 1. Use at UTS for our data repository, and for export from various services || 2. Lobby to get support integrated into Zenodo, Figshare et al || 3. Improve capture/packaging tools (Cra8it, Cloudstor Collections <your-system- ||    here> || 4. Work with others on aligning this work with other standards, [here's a list ||    someone else put together https://docs.google.com/document/d/155lA2BcixTl- ||    zwJHGfLkxsmg7WmQbBK00QWyP8QggkE || 5. Work with RDA on their repository interchange format. || 6. https://www.rd-alliance.org/groups/research-data-repository-interoperability- ||    wg.html
  • Use at UTS for our data repository, and for export from various services

  • Lobby to get support integrated into Zenodo, Figshare et al

  • Improve capture/packaging tools (Cra8it, Cloudstor Collections

  • Work with others on aligning this work with other standards, [here's a list someone else put together](https://docs.google.com/document/d/155lA2BcixTl- zwJHGfLkxsmg7WmQbBK00QWyP8QggkE/edit).

  • Work with RDA on their repository interchange format. https://www.rd-alliance.org/groups/research-data-repository-interoperability-wg.html



"Make data crate again" ||    Liz Stokes 2017

I'll leave it with this slogan from our UTS data librarian and friend of eResearch, Liz Stokes.

Thanks to:

  • Cameron Neylon for being customer zero

  • Liz Stokes for working on metadata crosswalking/mapping

  • Mike Lake for coding and ideas

  • Conal Tuohy and Duncan Loxton for commenting on the draft spec

  • Amir Aryani for discussions about metadata

And the mainly Sydney-based metadata group who met in the leadup to this work Piyachat Ratana, Sharyn Wise, Michael Lynch, Craig Hamilton, Vicki Picasso, Gerry Devine, Katrin Trewin, Ingrid Mason, Peter Bugeia