The eResearch team at UTS in collaboration with colleagues at QCIF and AARNet applied for and received funding under the Australian Research Data Commons "Institutional role in a data commons" grant scheme.

We applied for $50,000 but were only awarded $49,999 :(.

The proposal is reproduced below. We promised to share what we're doing - so there will be more coming about what this Oxford Common File Layout (OCFL)´ thing is, why it matters and what we're testing / demoing.

In short we want to explore how to store, preserve and publish large volumes of well-described research data using simple open technologies. We were already on this path and involved in international collaboration, the ARDC grant will help us get there quicker and test our ideas.

Title of the project: FAIR Simple Scalable Static Research Data Repository Demonstrator

Lead organisation/contractor: University of Technology Sydney

Project leader: Dr Peter Sefton – eResearch Support Manager

Project leader contact details: Email: peter.sefton@uts.edu.au

Phone: 0404 096932

Other organisations involved: AARNet – (Adam Bell) Queensland Cyber Infrastructure Foundation (QCIF) – (Andrew White)

Amount of funding requested (up to a maximum 50K): $50K

Proposal

Which area of fundable activity have you chosen:

  • Institutional data infrastructure, policies and procedures to support better research YES

  • Management of sensitive data YES

  • Integration of institutionally supported data infrastructure with national, discipline, and international infrastructure YES

Question(s) the project will address

In the interests of providing an improved procedure for resourcing, disposing and retaining data, can we demonstrate a FAIR research data repository architecture using static files laid out according to the Oxford Common File Layout (OCFL), an emerging international standards-effort that can operate sustainably at multiple scales from single-collection data sets to national collection? Can we also make this data Findable via a search portal using standards based metadata and make both open and sensitive data Accessible to the right parties, and improve tracking of data outputs managed inside and outside the organisation? Can OCFL increase opportunities for interoperability? Can distributing this data using the DataCrate specification increase the supply (and findability) of Interoperable and Reusable Research data objects with improved research data provenance? How might this architecture work with proprietary software such as Figshare and/or national services such as Cloudstor? How viable are these approaches for a range of disciplines? What other developments (e.g. standards for licensing sensitive data, procedures for applying access permissions at the file level for computing facilities, and national group management systems) would be needed to adapt this approach to storing and indexing sensitive data?

Proposed approach

We propose to build a demonstrator / proof of concept system which tests OCFL for use as a general-purpose data repository, for both open and sensitive data with a discovery portal. OCFL is chosen because it is based on established technology and can be used at computing facilities and on shared infrastructure without requiring server-based repository software or expensive and slow migration of large data collections via APIs.

The demonstrator will be populated with (1) specific datasets from the UTS data repository from a wide variety of disciplines including microbiology, history, computer science & speleology and of (2) varying scale from single collections to an entire university research data repository, building on the DataCrate for describing and packaging data, and (3) we will test the scalability of our approach by automatically generating a large number of plausibly-linked simulated test datasets and contextual entities (people, organizations, equipment, software describing data provenance) with group-based access permissions and demonstrate how a search portal can be used to ensure Findability and appropriate Access for the sensitive data by using an automated test suite to check the visibility of objects in a portal. We will also (4) demonstrate how individual data collections can be indexed in detail to produce collection-level discovery services, using two projects that were funded by the ANDS Major Open Data Collections: Farms to Freeways and Dharmae (UTS).

Who will be consulted/involved in the execution of the project and how will they be involved

This project will be led by Dr Peter Sefton at UTS who will also write the final report. eResearch Analyst Michael Lynch at UTS and staff at QCIF will develop specifications and software, building on preliminary work on a research data portal (for findability), extend it to ensure accessibility to the right users via a simple permissions system with static metadata that assigns group or individual access rights. The demonstrator repositories will be Interoperable with other software stacks using the same standard, particularly AARNet’s trial project “Adding Archival pathways to CloudStor” (investigating Archivematica as a preservation service) and via a project that is being proposed in Program 1 by the University of Melbourne. The project will consult with UTS stakeholders via the UTS eResearch Community of Practice which meets quarterly, via our regular eResearch outreach activities. We will consult with other Australian institutions via the ARDC network (we will offer to run webinars and keep the community up to date via mailing lists) and via our membership in Intersect. We will also consult with the international OCFL community.

Outputs/materials that will be shared with all of Australia as a result of the project

The project will result in open source code (both stand alone and as part of ReDBOX), open access documents; specifications and a report which answers the questions above in light of our findings. A project representative will deliver the report at the National Data Summit.

Evidence of ongoing commitment to outputs (if relevant)

UTS, along with QCIF, is one of the major contributors to the (originally) ANDS-funded ReDBOX Research Data Management Platform, and has demonstrated commitment to the ARDC community of REDBOX user institutions by contributing substantially to ReDBOX’s first major upgrade. UTS runs ReDBOX in order to support our strategic commitment to Research Excellence and Research Integrity and to implement our Research Management Policy - and will consult with the UTS Research Integrity officer Louise Wheeler. UTS is committed to building an OCFL based repository and discovery portal, but the work is not scheduled to be completed until 2020 - this funding will allow us to fast-track development of a demonstrator that can be presented to the ARDC, and the research community as a proof of concept for building a data commons at dataset, organisational, discipline and national scale. ReDBox is sustained by a community of Universities that pay a maintenance fee to the Queensland Cyber Infrastructure Foundation. The AARNet pilot project in this space will lead to sustainable investment should the demonstrator be successful.

Other information you wish to provide

Costing over 4 months is $50K (15 days of specification and design = $15K 30 days of coding = $30K 5 days of test-data development = $5K Report: (in kind) 5 days $5K) UTS is collaborating with the University of Melbourne on a proposal under Program 1, looking at the PARADISEC collection, and with AARNet investigating Archivematica & Cloudstor interoperating with OCFL & Datacrate packages, which complements this proposal. These projects together will aim to demonstrate the use of interoperable standards for building a data commons - testing a range of tools that operate over standardised (OCFL) repository architecture and standardised (DataCrate) metadata.