Omics Discovery Index (OmicsDI)

Notes given in the application form

Eligibility criteria

  • Must be an ELIXIR Service (i.e. part of an existing ELIXIR Node’s Service Delivery Plan, or is an ELIXIR Commissioned Service), or is in the official process/commitment of becoming one. (Required)
  • Must have evidence that it supports an interoperability activity, and has been deployed. (Required)
  • Must indicate how it supports the FAIR Principles. (Required)
  • Should fit into the EIP service framework in the ELIXIR 2019-2023 Scientific Programme for data interoperability or other activities relevant to the ELIXIR mission.

Additional notes

  • Please complete this form by adding information for your Interoperability Resource in the appropriate section below. Consult with Recommended Interoperability Resource (RIR) selection criteria documentation on details for each section below.
  • Where a panel/question is not relevant to your Interoperability Resource, please leave it blank or mark as “not applicable”, optionally with a brief explanation as to why.
  • Word limit guidance is noted for free text fields.
  • Please include urls to external resources, where useful.
  • Any questions, contact Sirarat Sarntivijai (sirarat.sarntivijai@elixir-europe.org).

1. Resource facilitation to scientific research

a. Interoperability Resource: Briefly describe the function of the Interoperability Resource

The Omics Discovery Index (OmicsDI) provides a metadata search portal for omics datasets from genomics, transcriptomics, proteomics, metabolomics, and systems biology models. As of October 2019, it indexes >450,000 datasets from 20 partner repositories in four continents.

Beyond discovery, OmicsDI supports prototype impact assessment for datasets, quantifying data re-use in other resources based on citations, data reprocessing and data integration, similar to well-established bibliometrics for publications.

Through improved discovery, impact metrics, and credit attribution for open datasets, OmicsDI aims to support recognition of data as highly relevant scientific output in its own right, complementary to publications.

b. Scope statement: describe the scope , and the users of the resource. How is the Interoperability Resource positioned with respect to other similar Interoperability Resources? Include the base URL and, if relevant, the introductory or “about” page URL.

The OmicsDI target community are scientists interested in re-using of existing omics data from their peers, as well as maximising findability of their own public data. While many sources for open data exist, they are often not connected, and finding all relevant data for a research topic is a major challenge. Even for a more limited domain, discovery of omics datasets, the challenge is significant. Technology-specific communities like transcriptomics, proteomics, metabolomics, have made great progress in co-ordinating data resources and data curation activities in their domain, but data discovery across domain borders remains difficult, requiring the user to search multiple entry points for data resources or resource collaborations like ProteomeXchange or IMEx in interactomics. Originally with funding from the US NIH BD2K program, OmicsDI set out to address this challenge, and to establish a valuable community tool similar to PubMed or EuropePMC, but for omics datasets. Interestingly, the same BD2K program also supported the development of a similar resource, bioCaddie/datamed (Chen et al. 2018). Datamed has a wider remit and aims to be a “biomedical data search engine”. However, it seems that with the end of BD2K funding, the development of Datamed has stopped, the last update on the website https://datamed.org/ dates from July 2017. Google is indexing datasets where they are marked up with the appropriate schema.org elements, but is currently not offering advanced services based on this markup. Thus, to our knowledge OmicsDI is currently the only major resource aiming to be a “PubMed/EuropePMC for omics datasets”.

c. Resource url

https://www.omicsdi.org/

d. Inter-organisational recognition: does the Interoperability Resource have community recognition? (e.g. demonstrated through a collaboration, geographical diversity in the source of the submissions, international diversity of delivery partners and/or funders)

OmicsDI started as a close collaboration with the international ProteomeXchange and MetabolomeXchange consortia, and builds on workflows developed by them. OmicsDI workflows are based on a "push" data model, the metadata indexed by OmicsDI is actively provided in OmicsDI format by its 20 partner databases from four continents. While inclusion of a new resource is an interactive process between the source data provider and the OmicDI team, this active data provision is a strong indication of the support by omics data providers for OmicsDI. It also leaves responsibility for updates, rich detail, and data quality with the source resources. Crucially, OmicsDI only integrates the metadata, the actual data remains with the data providers, thus partners don't "loose" web hits, and primary data access remains in the high quality context of the source data resource, rather than being reduced to a "lowest common denominator" as is often the case in databases that integrate the actual data from multiple resources.

OmicsDI has originally been funded by the US NIH BD2K program, and is currently supported by EMBL-EBI core funding, as well as two Elixir Implementation Studies, BioContainers and Galaxy. In the BioContainers project, we aim to develop a locally deployable version of the OmicsDI open source code, in the Galaxy project we aim to integrate automated OmicsDI data discovery into Galaxy workflows. A UK BBSRC BBR grant proposal is currently under evaluation.

2. Community

a. Community impact: If applicable, provide documented evidence of community impact (e.g., publication citations, API calls, projects using the resource, etc.)

The partner-contributed OmicsDI ​content has increased from <82,000 datasets from 11 repositories in January 2017 to >450,000 datasets from 20 resources in four continents in January 2020.

Web/API ​usage of OmicsDI has increased approximately 10-fold in two years in terms of unique hosts per month (1,839 (1/2018); 2,802 (7/2018); 9,352 (1/2019); 14,381 (7/2019); 22,249 (12/2019)) as well as web hits per month (252,160 (1/2018); 2,300,554 (12/2019); only hits from outside EMBL-EBI, crawler-filtered).

The OmcisDI reference publication ​(Perez-Riverol et al.2017) already has 57 citations(GoogleScholar) and is in the 98th percentile according to Altmetrics. The very recent “metrics” publication ​(Perez-Riverol et al. 2019)​ already has 8 citations, and with an Altmetrics score of 62 is in the 96th percentile.

OmcisDI is an early adopter of schema.org, all dataset views are marked up with schema.org since 2017, and OmicsDI is mentioned in a Google blog from January 2017 as a life science example for dataset providers (https://ai.googleblog.com/2017/01/facilitating-discovery-of-public.html). We have also contributed to Elixir BioSchemas development, and are ready to switch to BioSchema markup as soon as it is finalised.

b. Potential usage: Describe other systems that could use this candidate resource, but currently do not.

OmicsDI is a comparatively young resource, and is still far from reaching its full potential. The rapid growth of user numbers in the last two years is currently not showing a plateau. The major innovation of impact metrics for datasets is still at a very early stage in terms of coverage and has only been published in August 2019, but has a high potential for usage at the institutional level to assess the impact of grants, core facilities, etc.

With the upcoming integration with Galaxy (ELIXIR Implementation study), we aims to facility data reuse and discoverability for ELIXIR Compute platform and Galaxy users.

c. Outreach & support: Provide resource support publication(s)/user documentation(s) describing the Interoperability Resource (e.g. scientific journal publications, community preprints, resource user’s documentations etc.), resource dissemination plan (e.g. workshops, conference presentations), and other equal-opportunity research support (if applicable).

The OmcisDI reference publication ​(Perez-Riverol et al.2017) provides extensive supplementary material which describes, among others topics a) how a new data resource can join OmicsDI; b) the OmicsDI data format and a public validator tool; c) the public OmicsDI RESTful API. Updates additions to this documentation are published through our OmicsDI Blog (http://blog.omicsdi.org/) and the online API documentation at https://www.omicsdi.org/help/api .

A new publication specifically on the OmicsDI API is in preparation.

@OmicsDI maintains an active Twitter presence with 195 followers.

Since 2018, OmicsDI has been included as a lecture in the annual EMBL-EBI course "Introduction to Multiomics Data Integration and Visualisation" (https://www.ebi.ac.uk/training/events/2020/introduction-multiomics-data…). In addition, it has been so far presented at 10 talks, seminars, and webinars, among others to the EMBL-EBI Industry group, US NIH BD2K assembly, HUPO, and ISMB.

d. Dependency of other resources: How is this resource critical to the user(s)? Do other resources depend on the resource described here to provide downstream service? Please list, or provide a link to a diagram.

The following resources use OmicsDI web services online or in batch mode for inclusion in their own resource:

The Shanghai Institute for Bioinformation Sciences (http://english.sibs.cas.cn/) currently operates a mirror of OmicsDI (https://www.biosino.org/OmicsDI/) that is retrieving its data through web service calls to the EBI instance. With local grant support, this instance will gradually be developed into an independent instance with additional Chinese datasets.

Since 2018, Google datasets are indexing all the data from OmicsDI using the schema.org representation (https://toolbox.google.com/datasetsearch).

3. Quality of resource

a. Uptime: Average percentage uptime/month during the last 12 months, response time of the resource. In case of ontology/standards production, interval of update/release, adaptability of ontology design patterns to evolving data. Provide information where applicable: uptime of resource, software release cycle (please state week/month etc), update frequency.

OmicsDI daily average uptime is 99.6% (EMBL-EBI internal monitor, 05/2019-12/2019).

The average response time of the OmicsDI home page is 2.5 seconds, and 4 s for a random data page (https://www.omicsdi.org/dataset/arrayexpress-repository/E-GEOD-21339).

Updates: OmicsDI partners push updates to OmicsDI by providing updated data files in a defined location. This data is then automatically validated and indexed, normally within the same day.

b. Accessibility: what are resource retrieval mechanisms? Does the resource provide web-based user interface, application programmable interface (API), containers, and/or other channels? Please list resource access mechanism, provide URLs as applicable.

OmicsDI is accessible interactively through its web interface https://www.omicsdi.org/, as well as through the API, documented at https://www.omicsdi.org/help/api . The web interface is entirely built on top of the API, so all functionality is exposed by the API. The source data files contributed by the partner databases are accessible at http://ftp.ebi.ac.uk/pub/databases/omicsdi/ .

c. Maintenance quality: Is there a maintenance SOP or plan, reflecting sustainability and scalability? Does it align with guidelines for sustainable software development? Please include a resource commitment statement (description text or URL).

Sustainability: OmicsDI is a relatively small project, with currently one full time developer, plus support from the EMBL-EBI systems group, and supervision/backup by the project lead, approximately 0.25 FTE. The reliance on data updates being pushed to OmcisDI by the external partners allows us to keep the project lean, but makes OmicsDI also vulnerable to outdating data, a constant challenge in an integrative resource. EMBL-EBI central support, as well as the relatively small project size/cost, embedded in the larger EMBL-EBI Molecular Systems Cluster provides some resilience to funding loss.

Scalability is assured through instantiation in the EMBL-EBI HPC environment, where additional compute/storage requirements can be accommodated flexibly. With support from the Elixir BioContainers project, we are currently developing a cloud-deployable version of OmicsDI, based on Kubernetes. This will allow scalable deployment independent of the specific EMBL-EBI infrastructure.

d. Support quality: Please list support mechanisms (e.g., point of contact, request ticketing, resource’s response time where a solution is identified, etc.), and methods to collect user feedback. If available, list tutorial documentations or tutorial materials and format, including linking on the ELIXIR’s Training Portal (TeSS) (or other training platforms) where applicable.

The regularly updated OmicsDI Blog (http://blog.omicsdi.org/) provides help and documentation for key topics, among them the API documentation.

OmicsDI has a request tracker managed through the EMBL-EBI central request tracking system: https://www.ebi.ac.uk/support/index.php?query=omics-di . Automated ticket notifications reach the PI, project lead, and developer; the response is co-ordinated by the project lead.

4. Legal framework, funding, and governance

a. Legal framework: What are the resource’s license/terms of use? Can the license facilitate Open Science? Please include the url for the license the resource uses.

OmicsDI is open source (Apache 2), all source code is on GitHub. Licence statement is provided per repository, example URL https://github.com/OmicsDI/ddi-web-service/blob/master/LICENSE . While OmicsDI metadata is entirely open, some of the partner resources might contain protected, typically human data. For example, EGA metadata is discoverable in OmicsDI, but users will still need to follow established EGA procedures to gain access to the actual data. The integration of metadata on both open and protected access data resources is actually one of the strong points of OmicsDI, which helps to increase the often very limited discoverability of protected access data.

b. Privacy/Ethics policy: If applicable, is there a publicly available privacy policy in which use and security around personal data are described (e.g. the EU General Data Protection Regulation (GDPR), ELIXIR Ethics Policy, other relevant ELIXIR Policies)? Please include the url of the privacy/ethics policy, if applicable.

OmicsDI inherits the general EMBL-EBI terms of use, as linked from the landing page.

c. Funding & sustainability plan: List of funding sources supporting the resource, and sustainability plan.

OmicsDI is a relatively small project, with currently one full time developer, plus support from the EMBL-EBI systems group, and supervision/backup by the project lead, approximately 0.25 FTE, plus 0.1 FTE PI time.

OmicsDI has originally been funded by the US NIH BD2K program, and is currently supported by EMBL-EBI core funding, as well as two Elixir Implementation Studies, BioContainers and Galaxy. In the BioContainers project, we aim to develop a locally deployable version of the OmicsDI open source code, in the Galaxy project we aim to integrate automated OmicsDI data discovery into Galaxy workflows. A UK BBSRC BBR grant proposal is currently under evaluation.

d. Governance: Describe the Resource’s QA/QC plan that guarantees similar quality governance to that of ELIXIR. Please link SAB members, if applicable.

As a relatively small project, OmicsDI does not have a specific SAB. However, as part of the EMBL-EBI service portfolio, it is subject to review and governance by the EMBL-EBI governance structure, see https://www.ebi.ac.uk/about.