EPrints for Research Data – Workshop at University of Leeds (15/10/13)

Though the workshop was focussed specifically on EPrints software, many of the issues apply equally to other repository platforms.

Also see the event blog post at http://blog.library.leeds.ac.uk/blog/roadmap/post/184

Now that we’ve got Open Access to peer reviewed research output all sewn up (well, almost!) the next challenge is associated research data and on Tuesday 15th October the University of Leeds hosted a workshop exploring EPrints in this context. There is a storify of tweets from the event at https://storify.com/mrnick/eprints-and-research-data-collaboration-workshop

After an introduction from Bo Middleton there were short presentations from several institutions that had begun to explore the issue and Rachel Proudfoot began by describing Leeds repository requirements: why and how we chose EPrints (slides). Originally derived from the Jisc Managing Research Data pilot project RoaDMap – http://library.leeds.ac.uk/roadmap-project – Rachel emphasised that there were no real exemplars out there and as a starting point they considered the technical review of platform strengths and weaknesses from the KAPTUR project – http://www.research.ucreative.ac.uk/1239/ and compared requirements against DataFlow – http://www.dcc.ac.uk/resources/external/dataflow – and CKAN – http://ckan.org/ – “the world’s leading open-source data portal platform”.

Why EPrints (at Leeds)? Reasonable match against criteria, expediency, lower risk (in house expertise), short timescales #eprintsrdm

— Nick Sheppard (@mrnick) October 15, 2013

Michael Whitton was up next describing how Southampton are using an existing IR for data (slides) and demonstrated the customised EPrints interface with the deposit process for a dataset discrete from the standard workflow, a “minimalist” approach to metadata and the option to easily link a dataset to a journal article:

#eprintsrdm Like how Southampton have split the dataset metadata process – essential details seperated from additional details

— Valerie McCutcheon (@mccutchv) October 15, 2013

#eprintsrdm Southampton link to publications is working – must have a closer look at this

— Valerie McCutcheon (@mccutchv) October 15, 2013

Tom Ensom from the University of Essex described ReCollect: a research data plugin for EPrints (slides); the plug-in provides “expanded metadata profile for describing research data (based on DataCite, INSPIRE and DDI standards) and a redesigned data catalogue for presenting complex collections” and is available from the Bazaar- http://bazaar.eprints.org/280/

Valerie McCutcheon, not present in person, had pre-recorded a presentation on EPrints as a data registry (recorded presentation) describing how they are working to support researchers at the University of Glasgow in response to new requirements from funders (eg. RCUK) on how underlying research materials (i.e. data, samples, models) can be accessed – see http://www.gla.ac.uk/services/datamanagement/rdm-at-gu/ for more information of policy and support at UoG. Andrew Bell gave an overview of EPrints Services (slides) and how they might liaise with the community to prioritise development and Balviar Notay of Jisc concluded the morning with a review of the repository landscape and service transition from the RepNet project including the SHERPA sevices RoMEO and JULIET, RJ Broker, “OpenMirror“, OpenDOAR, IRUS-UK and emphasising that Jisc are considering support for linking datasets and other outputs.

Questions that arose in the Q & A discussion included workflow implications for big data sets (i.e. multi-terrabyte) as focus so far seems to be on traditional workflow/metadata input and minting DOIs to facilitate citation (a la figshare?)

A series of breakout groups proposed for the afternoon were introduced before lunch:

1. Access control requirements – due confidentiality issues, commercial sensitivity etc there is a requirement to provide some level of managed access to data sets.

2. Metadata requirements – exploration of metadata fields for a data registry (i.e. fields that could be applied to any data set).

3. EPrints gap analysis – brainstorm around RDM requirements with a view to informing an EPrints gap analysis.

4. Use cases – scenarios to inform more detailed requirements for EPrints e.g. data / user journeys during the research data lifecycle.

5. Discipline Specific Views onto Data held in EPrints (OR “Time-Signatures vs. Dynamic Viscosity”) – The multidisciplinary nature of research data at institutions poses particular challenges. In particular, how can we hope to store all the necessary discipline / project specific metadata that might be produced by research projects across large research intensive organisations? There may be scope to build a customised layer on top of a data repository to optimise how data is presented and navigated. Is this feasible? Is it desirable?

________________________________________________________________

In the event groups 3 and 4 were amalgamated and discussions were captured in Google docs which are linked below with a short summary:

1. Access control requirements – Capture document: http://bit.ly/165rt9

In discussions on jiscmail, several institutions have expressed an interest in more granular control of access to EPrints content; some access scenarios are supported ‘out of the box’ through EPrints embargo and request button features. However, these may not be sufficient for all access scenarios: for example, time limited access.

There were differences in opinion about the pros and cons of offering ‘Registered access’ to data. Although we can encourage maximum openness as best practice (for data without commercial or ethical requirements for restriction), research data deposit is new in several subject disciplines and some level of control may be the price we pay to populate data repositories during a period of cultural change.

Licence and re-use conditions should also be considered. Some commentators questioned whether the CC0 licence is appropriate for data. Others highlighted that incompatible licences with different re-use conditions will make it difficult or impossible to combine data sets; where feasible, metadata and research data should be openly available with as few restrictions as possible to avoid licence clashes.

2. Metadata requirements – Capture document: http://bit.ly/GN4jcK

The capture document represents a ‘master’ spreadsheet; community ownership is encouraged and ongoing discussion of core fields and field names.

3. EPrints gap analysis – combined with group 4 below

4. Use cases – Capture document: http://bit.ly/19F2M4Q

Use case scenarios include: submit data, find data, pull data, enrich with additional metadata, export to other systems, provide data in alternative formats, visualise data, relationships between objects, provide details of reuse, usage statistics.

It was also noted that it was also important to consider use cases that are out of scope. Such “anti” use cases might include large datasets, confidential data, “live” data that is continually changing.

What is missing from EPrints?

Grant code and auto-completion of other metadata fields from interaction with other systems; systems to interoperate with include CRIS (PURE, Symplectic), DMPOnline – https://dmponline.dcc.ac.uk/
During data import there should be some way to flag up if any confidential or sensitive data is being imported
Support for pseudonymisation for researchers that need to be identifiable, but that might need to keep their identity more private
Allow the user to modify access controls to data and metadata

Big Issues were identified as:

Security, confidentiality issues (Access control, Anonymisation/Pseudonymisation
Desire to have fewer systems or systems that better interoperate to reduce the input requirements
Development roadmap for EPrints (What is coming and when is it likely to be?)
Community collaboration/development process to get EPrints to do what we want it to do

5. Discipline Specific Views onto Data held in EPrints (OR “Time-Signatures vs. Dynamic Viscosity”) – Capture document: http://bit.ly/16IvDme

Discovery function is provided for by current metadata fields but how do we provide more detailed discovery or even navigation within a dataset? How can EPrints be configured to provide more disciplinary / subject specific metadata needed for data reuse (reuse metadata) and should we do this?

Wide range of potential users – scientists, maybe storing their data on their own systems (why would they want to use the repository?) or arts researchers needing somewhere to store their datasets – They have very different ways of documenting their data and searching for data.

If eprints can’t provide this functionality, can we envisage a separate discovery layer to the architecture?

What next?

There was discussion of the potential to bring EPrints/plug-in developers and less technical repository and research practitioners together for some sort of “hack day” or mash-up event; the central message was to keep talking and collaborating across institutions and with EPrints services:

Andy attended #eprintsrdm yesterday. We have already had some preliminary discussions in the team.

— EPrints Services (@EPrintsServices) October 16, 2013

Let us have any use cases for controlled access to data in #EPrints, the topic of work group 1 at #eprintsrdm.

— EPrints Services (@EPrintsServices) October 16, 2013