How LDbase Meets NIH Guidelines

Summary: As part of the National Institutes of Health’s (NIH) push towards supporting open science practices and data sharing, the NIH has put out guidance on how to select a data repository and the characteristics one should look for when choosing where to share data. This document describes how LDbase meets such standards, providing links to resources where necessary.

Desirable Characteristics for All Data Repositories.
The characteristics in this section are relevant to all repositories that manage and share data resulting from Federally funded research:
1. Unique Persistent Identifiers: Assigns datasets a citable, unique persistent identifier (PID), such as a digital object identifier (DOI) or accession number, to support data discovery, reporting (e.g., of research progress), and research assessment (e.g., identifying the outputs of federally funded research). The unique PID points to a persistent landing page that remains accessible even if the dataset is de-accessioned or no longer available.
  LDbase provides the option to assign unique digital object identifiers (DOI) for all projects and associated files (datasets, codebooks, code, etc.) free of charge to the user.
2. Long-Term Sustainability: Has a plan for long-term management of data, including maintaining integrity, authenticity, and availability of datasets; building on a stable technical infrastructure and funding plans; and having contingency plans to ensure data are available and maintained during and after unforeseen events.
  LDbase was built with sustainability in mind. Keeping ongoing costs low was the plan to start, which is why we don’t do things like check inside datasets stored on LDbase. LDbase is a collaboration with a large team that includes FSU Libraries, which includes LDbase as part of its collection. As such, they have guaranteed long-term management of LDbase.
3. Metadata: Ensures datasets are accompanied by metadata to enable discovery, reuse, and citation of datasets, using schema that are appropriate to, and ideally widely used across, the community(ies) the repository serves. Domain-specific repositories would generally have more detailed metadata than generalist repositories.
  All datasets, as well as projects and other associated files (codebooks, code, stimuli, etc.) have a wide range of metadata available to increase the discovery, reusability, and citability of said files (See requested metadata here). Further, projects and associated files are structured such that users can search for terms in the metadata within any given file’s metadata, not just datasets alone. LDbase is domain-specific and provides users who are uploading data or files with extensive guidance, examples, and the ability to select from a pre-populated list of common terms within the field.
4. Curation and Quality Assurance: Provides, or has a mechanism for others to provide, expert curation and quality assurance to improve the accuracy and integrity of datasets and metadata.
  LDbase is in the process of establishing a paid service that will allow users to receive various levels of support in preparing, deidentifying, and uploading datasets. Services will additionally focus on increasing the discoverability and reuse of data, paying particular attention to metadata and the inclusion of additional supporting documents.
5. Free and Easy Access: Provides broad, equitable, and maximally open access to datasets and their metadata free of charge in a timely manner after submission, consistent with legal and ethical limits required to maintain privacy and confidentiality, Tribal sovereignty, and protection of other sensitive data.
  LDbase is an entirely free data repository for both data uploading or downloading/reuse. All policies and procedures are in accordance with the relevant legal and ethical guidelines.
6. Broad and Measured Reuse: Makes datasets and their metadata available with broadest possible terms of reuse; and provides the ability to measure attribution, citation, and reuse of data (i.e., through assignment of adequate metadata and unique PIDs).
  All (not embargoed) datasets and associated files can be easily downloaded and reused, and further are assigned DOI’s allowing for the ability to measure the attribution, citation, and reuse of said data and files. A unique aspect of LDbase is the project-centered structure, allowing users to upload information and create unique identifiers for the project, datasets, codebooks, code, stimuli, measures, or any other files that are associated with said larger project. As such, LDbase allows for broad and measured reuse of multiple components of a project and encourages reuse and citing of more than just data. LDbase also tracks and publishes downloads and page views.
7. Clear Use Guidance: Provides accompanying documentation describing terms of dataset access and use (e.g., particular licenses, need for approval by a data use committee).
  LDbase provides extensive guidance, resources, and FAQ’s related to dataset uploading, access, terms of use, and licensing. Additionally, documentation to support usage are available related to IRB considerations, data management and deidentification, data reuse and combination, and general information related to open science practices.
8. Security and Integrity: Has documented measures in place to meet generally accepted criteria for preventing unauthorized access to, modification of, or release of data, with levels of security that are appropriate to the sensitivity of data.
  LDbase has documented measures in place to provide the necessary security and privacy required for the stored data and information. Documentation is available on LDbase related to these measures, including a security/privacy focused FAQ, terms of service, and documentation related to the specific systems and processes being used to ensure security and integrity.
9. Confidentiality: Has documented capabilities for ensuring that administrative, technical, and physical safeguards are employed to comply with applicable confidentiality, risk management, and continuous monitoring requirements for sensitive data.
  LDbase has documented measures in place to ensure confidentiality. Again, documentation is available on LDbase related to these measures, including a security/privacy focused FAQ, terms of use, and documentation related to the specific systems and processes being used to ensure security and integrity.
  
  LDbase additionally uses Amazon Web System’s (AWS) S3 File system to store files uploaded onto LDbase, which has its own internal slew of procedures and safeguards to help ensure confidentiality. Files are backed up nightly and stored in AWS Glacier, which is designed for the specific purpose of confidentially archiving data. AWS S3 Glacier has been approved for storing even the most sensitive data, including medical images and genomic data, The use of this system further allows for versioning of data to increase the accessibility and confidentiality of data.
10. Common Format: Allows datasets and metadata downloaded, accessed, or exported from the repository to be in widely used, preferably non-proprietary, formats consistent with those used in the community(ies) the repository serves.
  LDbase explicitly recommends non-proprietary formats. LDbase supports an extremely wide range of file formats (both for datasets and additional documentation) including but not limited to those that are commonly used within the field. For file formats that are currently not supported, LDbase allows individuals to request new file formats for support. As such, it is a goal of LDbase to support any and all formats to the extent that file format will provide no barrier to successfully using LDbase for data sharing.
11. Provenance: Has mechanisms in place to record the origin, chain of custody, and any modifications to submitted datasets and metadata.
  Each dataset and related documentation has a field where authors can be added, which can be seen when looking at the dataset/documentation on LDbase, as well as in the recommendation citation for the dataset and related documentation. Every dataset and related documentation that is uploaded to LDbase can be versioned, so that a user can identify which version of data they have used. Any user can also request to be notified if a given dataset or document has been updated.
12. Retention Policy: Provides documentation on policies for data retention within the repository.
  Policies related to data retention within LDbase are clearly outlined in the extensive resources and FAQ’s available on the website. Information specific to data retention are available in the Terms of Service and Data Security FAQ.
  
  The guidance provided by NIH additional describes characteristics regarding storing human data that high quality data repositories should exhibit. The following sections lay out the NIH’s desired characteristics, with the sections that follow each point describing how LDbase meets these standards.
Additional Considerations for Repositories Storing Human Data (even if de-identified)
The additional characteristics outlined in this section are intended for repositories storing human data, which are also expected to exhibit the characteristics outlined in Section I, particularly with respect to confidentiality, security, and integrity. These characteristics also apply to repositories that store only de-identified human data, as preventing re-identification is often not possible, thus requiring additional considerations to protect privacy and security.
1. Fidelity to Consent: Employs documented procedures to restrict dataset access and use to those that are consistent with participant consent (such as for use only within the context of research on a specific disease or condition) and changes in consent.
  LDbase allows for any data or associated files to be embargoed and have restricted access, and provides the option to do so both before and after said data or files have been shared on the site. For said restricted data, LDbase provides the option to directly contact project administrators to request access to said restricted data. As such, there are clear ways to restrict the use of data to adhere to such policies and considerations.
2. Restricted Use Compliant: Employs documented procedures to communicate and enforce data use restrictions, such as preventing reidentification or redistribution to unauthorized users.
  LDbase is an open data sharing repository of deidentified data, and does not specialize in restricted use enforcement (as much data repositories do not). LDbase has documentation to support data use agreements choices by investigators. LDbase allows for data use restrictions via an embargoed data feature, but does not police these functions.
3. Privacy: Implements and provides documentation of appropriate approaches (e.g., tiered access, credentialing of data users, security safeguards against potential breaches) to protect human subjects’ data from inappropriate access.
  LDbase does not accept any human subjects data in audio/visual format or containing human genetic data that can be easily reidentified. Rather, we encourage and provide resources for the sharing of deidentified data with minimal risk for reidentification.
4. Plan for Breach: Has security measures that include a response plan for detected data breaches. (including old versions that might have deidentification mistakes)
  LDbase encourages proper data management, deidentification, and uploading practices to minimize any potential issues that could results from a data breach. As such, given the already deidentified nature of the data on LDbase, the risks associated with data breaches are already low. However, if a data breach were to occur, LDbase lays out how it will respond to such an occurrence in Section 11 of the LDbase Terms of Use.
5. Download Control: Controls and audits access to and download of datasets (if download is permitted).
  For embargoed data, users must be registered with a real email to get access to the data. Data sharing investigators have control of who they give access to the embargoed feature. LDbase does not police these functions. Openly shared data can be downloaded by an non-registered user, and due to GDPR regulations, we do not track this use.
6. Violations: Has procedures for addressing violations of terms-of-use by users and data mismanagement by the repository.
  LDbase has procedures listed in its Terms of Use to address violations and mismanagement of data by its users. Specifically, LDbase will automatically terminate the license granted to any user who engages in unauthorized use of any of LDbase’s services or breaches the Terms of Use in anyway. Within the terms of use as well, LDbase additionally clearly lays out on whom the liability for said violations will fall.
7. Request Review: Makes use of an established and transparent process for reviewing data access requests.
  LDbase has an embargoed system that investigators can choose to use. There is a system request process that any registered user can use to communicate with project investigators to get access to embargoed data. Decisions on whether those requests are granted are up to the project investigators and not a choice that LDbase is part of.