28 March 2012

Bazaar for Version Control

One of the MATSIQEL RDM requirements is that software must support multiple versions of the research data. Due to data protection and ethical constraints, only some project partners may see all the research data, others may see and update the data, whilst others are denied access to raw (partially anonymised data) but may see the processed data.
These requirements, especially version control caused us to consider version control software as used for software development. Such software (as used in Microsoft's SharePoint) also gives differential user access rights and permissions. However, SharePoint is commercial product, so can not be recommended for unfunded work (i.e. when there is no funding for IT software/hardware).
Since most of the project team are not Computer Scientists software is needed that has the lowest possible barrier to entry, or it will not be used. That is, in addition to satisfying the technical requirements, data management software needs to be:
·         Easy to Use
·         Free
·         Multiplatform (Windows, Mac, Linux)
We examined several open source software source control management products to identify one to complement our case study.
Git is the most popular distributed version control system. Written by Linus Torvalds, it is used to manage development of the Linux kernel. It is (reputedly) fast, and allows free hosting on GitHub for projects that use it. Git however is designed for efficient software development and so saves versions of files as collections of incremental changes on a base file. That is, any particular version is assembled from pieces. This is counter to MATSIQEL requirements where versions of research data arrive externally and are not necessarily increments.
The other alternative evaluated is called Bazaar. Bazaar is version control software 'for everyone'. Sponsored by Canonical and used to develop Ubuntu Linux, Bazaar claims
·         "Version control for everyone
·         Work offline
·         Any workflow
·         Cross platform support
·         Rename tracking and smart merging
·         High storage efficiency and speed
·         Any workspace model
·         Plays well with others"

Bazaar is also a distributed version control system. This avoids central reliance on a single bottleneck and does allow multiple workflow styles. In particular, it is straightforward to set up a web based repository that end users can access freely and appropriately. Bazaar has several graphic clients that integrate well with Windows. Most ordinary users will be able to use bazaar version control and access to centrally stored data with minimum impact on their usual workflow.
Data under Bazaar version control is simply stored (invisibly) in subdirectories. Graphic version histories are readily available. Also, since Bazaar does not exploit proprietary storage mechanism, a bazaar repository may be zipped, archived (e.g. in Sharepoint), and revived intact as needed.
Bazaar may be simply configured on cloud based web servers, which may be set up with the kind of access controls needed, granting differential access rights as needed by the MATSIQEL project. In summary Bazaar is a  multiplatform product that fulfils requirements for research data management in our case study project MATSIQEL since it supports repositories.
Posted on behalf of Jeremy Ellman

EU data/records retention requirements

The MATSIQEL research project which is the focus of our RDM project is funded by the European Commission (Marie Curie Programme). During our project queries were raised by us, our FoI Officer/Records Manager and the MATSIQEL team about the EC’s data/records retention requirements. In particular:

·         Does the EC have a policy/schedule for the retention of records related to EU funded work by other organisations who have conducted the work? We hadn’t found a policy/schedule published on the EC website.
·         If there are no requirements should we be following our own records retention policy?

Answers to these questions are important for not only this project but more generally for all EU funded projects.

A contact on mine at the EC put me in touch with someone from the Secretariat General -Document Management Policy (http://ec.europa.eu/transparency/archival_policy/index_en.htm) who was responsible for their own internal records retention schedule. After consulting with the DG Research they informed us that “there is no general [my emphasis] EU policy for the retention of documents produced by organisations funded by the EU, as the rules of the various EU Funding Programmes are very different.” This means that the preservation of documents by the beneficiary (i.e the HEI in our case) changes according to the EU programme which finances the project.

I was given some examples which I’ve provided below along with their URLs. What is clear is that, as well as the data/document requirements varying according to the EU programme that funds the project/action, the focus seems to mostly be on FINANCIAL records rather than research records and that, for ESF and FP7 funded projects, it is the ORIGINALs which must be kept unless authenticated electronic copies are made.

Examples of EC retention requirements:

1.    FP7 FUNDED PROJECTS
Annex II to the Grant Agreement  (FP7 Grant Agreement - Annex II General Conditions V6 24/1/11 ftp://ftp.cordis.europa.eu/pub/fp7/docs/fp7-ga-annex2-v6_en.pdf) states the following rule:

"The beneficiaries shall keep the originals or, in exceptional cases, duly authenticated copies – including electronic copies - of all documents relating to the grant agreement for up to five years from the end of the project." (Article II.22. Financial audits and controls, p24)

Note: the obligations in Annex II are the same for all FP7 funded actions/projects. This is the only source of obligation for preserving the documents produced by a FP7 project [that we can use] when working on E-Domec rules.


2.    Projects financed by structural funds or agriculture policy
Their Basic Regulations establish the retention periods for documents in possession of beneficiaries. For example: Article 9 of Commission Regulation (EC) No 885/2006 of 21 June 2006 laying down detailed rules for the application of Council Regulation (EC) No 1290/2005 as regards the accreditation of paying agencies and other bodies and the clearance of the accounts of the EAGF and of the EAFRD (agriculture policy) (http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CONSLEG:2006R0885:20081029:EN:PDF)

"Article 9 Conservation of accounting information
1. The supporting documents regarding the expenditure financed and the assigned revenues to be collected by the EAGF shall be kept at the disposal of the Commission for at least three years following the year in which the Commission clears the accounts of the financial year concerned under Article 30 of Regulation (EC) No 1290/2005.
2. The supporting documents regarding the expenditure financed and the assigned revenues to be collected by the EAFRD shall be kept at the disposal of the Commission for at least three years following the year in which the final payment by the paying agency has taken place.
3. In the case of irregularities or negligence, the supporting documents referred to in paragraphs 1 and 2 shall be kept at the disposal of the Commission for at least three years following the year in which the sums are entirely recovered from the beneficiary and credited to the EAGF or the EAFRD or in which the financial consequences of non-recovery are determined under Article 32(5) or Article 33(8) of Regulation (EC) No 1290/2005.
4. In the case of a conformity clearance procedure provided for in Article 31 of Regulation (EC) No 1290/2005, the supporting documents referred to in paragraphs 1 and 2 of this Article shall be kept at the disposal of the Commission for at least one year following the year in which that procedure has been concluded or, if a conformity decision is the subject of legal proceedings before the Court of Justice of the European Communities, for at least one year following the year in which those proceedings are concluded."


3. European Social Fund Projects
Article 90 of Council Regulation (EC) No 1083/2006 of 11 July 2006 laying down general provisions on the European Regional Development Fund, the European Social Fund and the Cohesion Fund and repealing Regulation (EC) No 1260/1999 (http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2006:210:0025:0025:EN:PDF)


"Article 90 Availability of documents
1. Without prejudice to the rules governing State aid under Article 87 of the Treaty, the managing authority shall ensure that all the supporting documents regarding expenditure and audits on the operational programme concerned are kept available for the Commission and the Court of Auditors for:
(a) a period of three years following the closure of an operational programme as defined in Article 89(3);
(b) a period of three years following the year in which partial closure took place, in the case of documents regarding expenditure and audits on operations referred to in paragraph 2.
These periods shall be interrupted either in the case of legal proceedings or at the duly motivated request of the Commission.
2. The managing authority shall make available to the Commission on request a list of completed operations which have been subject to partial closure under Article 88.
3. The documents shall be kept either in the form of the originals or in versions certified to be in conformity with the originals on commonly accepted data carriers."

JISC Managing Research Data Programme workshop on Data Management Planning, 23 March 2012

Simon Hodson, Programme Manager has made the presentations from this workshop available at http://www.jisc.ac.uk/whatwedo/programmes/di_researchmanagement/managingresearchdata/dmpworkshop.aspx

The presentations were about projects tasked to explore the challenges of designing and implementing data management plans for research projects or for departments in specific disciplines, and to customise and implement the DCC's DMPonline tool for specific uses.  DATUM in Action is one of the projects (see http://www.jisc.ac.uk/whatwedo/programmes/di_researchmanagement/managingresearchdata/planning.aspx and http://www.jisc.ac.uk/whatwedo/programmes/di_researchmanagement/managingresearchdata/dmponline.aspx for all projects).

David Shotton’s analysis of various DMP templates was particularly interesting and DCC’s demo of Version 3 of their DMPOnline looks very promising. (See David’s blogposts on the Oxford DMPonline Project http://datamanagementplanning.wordpress.com/2012/03/27/dmp-questions-description-and-alignment/ and http://datamanagementplanning.wordpress.com/2012/03/27/dmp-questions-comparisons-and-conclusions/ and Kelly Miller’s blog post summarising suggestions for further developing the DCC DMPOnline http://www.dcc.ac.uk/news/future-plans-dmp-online)

For a list of current interim outputs from all the JISC-funded data management planning projects see https://docs.google.com/document/d/1khPrdQ2JNVWYYMtHTN40TdbueJ2agpxwWMpYfiZcN6M/edit   

The results of a competition at the workshop to gather participants' opinion as to:
1) which project had produced the most reusable outputs;
2) which project had produced the most potentially significant outputs (even if they were not yet reusable); and
3) which project participants wanted to find out more after the workshop.

The winners in each category were:
1) Richard Plant, University of Sheffield, for the Data Management Storage and Planning for Psychology http://www.sheffield.ac.uk/psychology/research/groups/dmsppsych ; http://dmsppsych.blogspot.co.uk/
2) Julie McLeod, University of Northumbria, for the Datum in Action Project http://www.northumbria.ac.uk/sd/academic/ceis/re/isrc/themes/rmarea/datum/action/ ; http://datumrdm.blogspot.co.uk/
3) David Shotton, for the Oxford DMPonline Project http://datamanagementplanning.wordpress.com/

22 March 2012

Comparison of IT infrastructures

The DATUM in Action project piloted two IT infrastructures: (i) use of existing standard office software and a secure shared network drive; (ii) an experimental prototype collaborative infrastructure environment – setting up a team site in a paid-for, cloud-based SharePoint service.

The requirements for the EU Team were:
  • Access / data sharing by researchers in different institutions / countries (all within the EU or with safe harbour agreements)
  • A filespace
    • Access rights to be set up at the folder/sub-folder
    • Automatic version control of files
    • Automatic application of retention periods
  • Email system
  • Project wiki, for researchers to collaboratively develop documents/presentations
  • A public-facing blog for dissemination
  • Public-facing website
We have compared the way the two infrastructures met these requirements. Note: we are still testing the SharePoint prototype.

(1) Access / data sharing by researchers in different institutions / countries (all within the EU or with safe harbour agreements)

Standard office IT facilities
  • The shared drive is accessible only to Northumbria University staff. The University is reluctant to give access to external people (understandably from a security viewpoint). Data sharing is by anonymisation of data and use of encrypted files through services such as Dropbox, and the use of encrypted laptops and data sticks.
SharePoint prototype
  • Being a cloud service, access to all project researchers is easily arranged by giving them ids & passwords. The number of people able to use the site is governed by the price paid. However there are issues with data protection: the country of origin of the cloud service provider needs to be in the EU or have a safe harbour agreement. And how secure is the service? You would expect the cloud service provider to offer the same level of security as a University, but what access do the provider’s IT staff have to the data? And how are back ups handled if the provider goes bust? It is recommended that when using a cloud service provider a service level agreement should be drawn up. If SharePoint was made available by the University than the same external access issues would occur as with the standard office IT facilities.

(2) A filespace
  • Standard office IT facilities & SharePoint prototype: exactly the same
(2a) Access rights to be set up at the folder/sub-folder

Standard office IT facilities
  • This can be done, but scope is limited
SharePoint prototype
  • Far more scope available
(2b) Automatic version control of files

Standard office IT facilities
  • This has to be done manually, by adding version numbers to file names
SharePoint prototype
  • Very flexible, detailed automatic versioning, with the ability for the site administrator to customise this
(2c) Automatic application of retention periods

Standard office IT facilities
  • This has to be done manually, via the use of sub folders containing files of a given category: the whole sub folder can be deleted when required.
SharePoint prototype
  • This is still under test
(3) Email system

Standard office IT facilities
  • Outlook
SharePoint prototype
  • Same email software, however in the SharePoint prototype this is integrated, so for example when updating files other people can be alerted by email that a file has been altered. This enables workflow processes. An email facility is a higher price option.
(4) Project wiki, for researchers to collaboratively develop documents/presentations

Standard office IT facilities
  • A free service by a cloud provider would have to be used
SharePoint prototype
  • The wiki is integrated with the team site

(5) A public-facing blog for dissemination

Standard office IT facilities
  • A free service by a cloud provider would have to be used
SharePoint prototype
  • The blog is integrated with the team site. A public blog facility is a higher price option.
(6) Public-facing website

Standard office IT facilities
  • Pages set up on the University’s website
SharePoint prototype
  • A website is integrated with the team site. A web facility is a higher price option.
SharePoint offers much more functionality (including automatic versioning for example), and integration of different facilities. However, this comes at a cost. Use of a cloud-based service would require funding, e.g. as an item within a proposal budget. Universities, who have not already done so, could set up SharePoint implementation, either across the institution or for a specific activity such as research. Basic SharePoint comes with the academic site license, however staff resources would be needed to set up and run the implementation. There is also the barrier to entry, i.e. all researchers would need to learn how to use a new system, and how to set up a team site. But this barrier is not high. It could be likened to the adoption of VLEs within universities. Initially there was opposition from some academic staff, now all staff use them as a standard system. The VLE is set up with standard module templates, and help guides and training is available. Similarly in SharePoint, a research project template for team sites could be made available which could then be customised by researchers for specific projects. However, uncontrolled use of team sites could lead to SharePoint sprawl.

Standard office IT facilities may represent fit-for-purpose supporting infrastructure for managing the data of much research that is conducted HEIs. HEIs / researchers should assess whether or not investing in a sophisticated system such as SharePoint is necessary. Are the benefits of adopting it for a research project great enough to outweigh the costs (financial, training, development) if the system is not already implemented?





DMP Template

The DATUM in Action project developed a customised data management plan (DMP). Initially we used the DATUM for Health DMP template which was developed from the DCC’s template (DCC Checklist for a Data Management Plan Post-Consultation (v2.2: 6th January 2010) https://dmponline.dcc.ac.uk/documents). However, feedback from the EU researchers, after they tried to complete this, showed that we had to radically alter the design. In summary this new approach was based on 3 main themes:

(1) to focus on the researcher and what is needed to help them conduct their research project on a day by day basis. The focus has moved away from data curation of shared data after project completion.

(2) to reduce bureaucracy, i.e. DMP lite. The front pages of the DMP enable the researcher to select the sections that are currently applicable to them and ignore the rest. Additionally, to reduce duplicating information held in other systems/documents the researcher is just asked to give the location of the relevant document (either a folder on a computer drive or a physical location for paper-based items).

(3) to embed decisions and actions so the DMP becomes a living document for the duration of the project.

The DMP template has been posted on the project website:

Guidance for completing the DMP is currently being written and will be posted on the project website. 


Information Security Guidance

This has been posted on the project website:

01 February 2012

Fileplan for the shared drive

The action taking phase of the DATUM in Action project includes the EU project staff implementing the DMP using in-house software and shared drives. The EU project staff have requested help in organising files on the MATSIQEL shared drive. We have drawn up a fileplan, based on previous experiences, and this has been set up and is currently being populated by EU project staff with existing and new files. In that process the fileplan may need some amendment. The outline fileplan is given below and a more detailed version is available on the Project website.

ProjectDevelopment
ProjectManagement
  • Administration
  • Agreements
  • Dissemination
  • Finance
  • FunderCommunication
  • Meetings
  • Personnel
  • Planning
  • Reporting
WorkPackage[name/number]           
  • Dissemination
  • Ethics+Governance
  • ProjectManagement
    • Administration
    • Meetings
    • Finance
    • Planning
    • Reporting
  • ResearchLiteratureReview
  • Research[name of activity]
    • Administration
    • Data
    • DataAnalysis
    • LiteratureReview
    • Outputs
    • Tools
The complexity in the above fileplan comes from the complexity of the MATSIQEL project. This comprise several Work Packages each with their own leaders and researchers, and under each Work Package be a number of different research activities undertaken. This causes the repetition of folders such as ProjectManagement or Finance that would contain files applicable either to the whole project or only to the specific Work Package. As the fileplan is populated decisions will be made about the appropriate level for files and whether this repetition is necessary.

A simplified version of the fileplan for a simpler project might comprise:

ProjectDevelopment
ProjectManagement
  • Administration
  • Agreements
  • Dissemination
  • Ethics+Governance
  • Finance
  • FunderCommunication
  • Meetings
  • Personnel
  • Planning
  • Reporting
  • ResearchLiteratureReview
  • Research[name of activity]
    • Administration
    • Data
    • DataAnalysis
    • Outputs
    • Tools
The next stage is to apply access controls to the folders. Certain folders such as ProjectManagement>Personnel contain files that should only be seen by the PI. Other folders such as ProjectManagement>Dissemination contain files that need to be accessible and usable (read & write) by all members of the team.

The EU project staff also requested help with file naming and version control. Guidance on fileplans, file naming and version control is in production. This tailored guidance is drawing on, and referencing, existing published guidance.

Although a fileplan is not rocket science, it is an unfamiliar concept to many researchers. It is also not as simple to produce as you might expect as all research projects vary in their nature, size and the demands placed upon them. We have used names for the folders/sub-folders which means that folders are in A/Z rather than logical order. We felt that numbering folders to achieve logical ordering would not be acceptable to the researchers. The other problem is that of retention management. Folders are likely to contain items with different retention periods. Either we create additional sub-folders to reflect this, enabling folders to be deleted when required, or we would have to accept that everything is kept for the longest required period: deletion at the granular level of the file is too time consuming to be practical.