Machine Learning Focus Group

DOME illustration

Machine learning (ML) is a discipline that enables researchers to use computers to make sense of large, complex datasets.

With the decreasing costs of high-throughput technologies, vast amounts of omics data are being generated and made accessible to the scientific community. Analysing these complex, high-volume datasets is challenging, and classical statistics methods are often insufficient to fully unlock their potential. ML can thus be very useful for mining large omics datasets to uncover new insights to advance life sciences.

The ELIXIR ML Focus Group was established in October 2019 to meet the emerging need for ML expertise within its network.

Goals

ML standards

This area includes aspects such as controlled terminology/ontology and services for ML model description and sharing, alignment to the ELIXIR Tools and Interoperability Platforms, as well as defining best practices for ML-related reviewing.

ML and reproducibility

This area focuses on the definition of the best practices for developing, sharing and reusing ML approaches (including, but not limited to ML models, algorithms, frameworks and protocols including the DOME recommendations), while still involving the existing approaches in the ELIXIR Tools Platform.

Benchmarking of ML tools

To facilitate clear and objective comparison of ML-based tools, it's important to establish a benchmarking protocol; this may include datasets, protocols and services offered by the ELIXIR Tools Platform.

Training for ML

Addressing gaps in ML knowledge is a priority for this group. A key area of focus will be designing and producing training resources for the ELIXIR community, based on the standards and approaches established by the ELIXIR Training Platform.

Integration across ELIXIR Communities

ML is a core competency relevant to many ELIXIR activities, and is clearly aligned to several funding opportunities. This group will consistently align and coordinate these efforts across all relevant ELIXIR groups, including the Federated Human Data Community and the Data Platform.

Task force 1: DOME recommendations crowdsourcing annotation

Publishing the DOME recommendations was the Focus Group’s first major output . This is a set of community-wide recommendations for reporting supervised ML-based analyses of biological studies. Broad adoption of these recommendations will help improve ML assessment and reproducibility.

This Task force builds on this output to perform a community-driven, crowdsourcing annotation effort for publications to produce a corpus of well-annotated articles.

This Task force is connected to the DOME Strategic Implementation Study: A framework to standardise ML in Life Sciences, in the context of which the DOME registry and the DOME wizard have been implemented.

Task force 2: Review and collection of gold standard datasets

AI-ready datasets are needed for training and benchmarking ML methods. The main goal of this Task is to run a comprehensive effort around gold standard datasets, collecting both paper candidates with datasets on selected domains (e.g. genomics, transcriptomics, proteomics), as well as datasets that can be applied in life sciences ML, specifically for supervised omics learning applications.

The main areas of activity: 

  • Reviewing scientific literature to identify datasets previously used for ML benchmarking
  • Defining appropriate schemas and file formats for facilitating data sharing and interoperability
  • Collecting metadata reporting information on data sources and annotation
  • Defining standard validation procedures to define fair validation sets

Task force 3: Synthetic data

Synthetic data represents artificially generated information that mimics real patient data, while ensuring privacy and confidentiality. In life sciences, synthetic data use facilitates development opportunities by enabling researchers to access vast and diverse datasets, thus accelerating drug discovery, disease modelling and personalised medicine. 

The Task Force on synthetic data is focused on developing frameworks, guidelines and best practices for generating, evaluating and using synthetic data in life sciences. This synthetic data can be generated by ML methods and/or used to build highly performance ML models. Recent accomplishments include: 

  • Establishing a synthetic dataset catalogue aligned with the EDAM ontology
  • Defining a metadata model for the publication of FAIR synthetic datasets
  • Establishing a synthetic data registry
  • Conducting a scoping review focused on synthetic data evaluation metrics
  • Implementing a community survey to gauge the utilisation of synthetic data. 

The Task Force organised a dedicated workshop on Advancing Synthetic Data Generation and Dissemination for Life Sciences at the 22nd European Conference on Computational Biology (ECCB 2024). Finally, coordinating members of the Task Force are leading several work packages within SYNTHIA (GA 101172872), a pioneering project under the Innovative Health Initiative (IHI) program, a European public-private partnership focused on advancing healthcare innovation. This important involvement underscores the essential role of the Task Force in advancing synthetic data while also significantly strengthening industry engagement within ELIXIR.

Focus Group outputs

Task force 1:

In July 2021,the ML Focus group published the DOME recommendations in Nature Methods. DOME is a set of community-wide recommendations for reporting supervised machine learning–based analyses applied to biological studies. Broad adoption of these recommendations will improve machine learning assessment and reproducibility.

Going beyond a standard, the DOME recommendations can facilitate reproducibility in ML through the clear definition of the involved steps. As such, it can be also used in training capacity, assisting the implementation and overall design of ML studies in the life sciences. More information is available on the DOME-ML website.

The DOME Registry has also been implemented, currently capturing the curated annotation of more than 180 published papers (and more than 20 that currently under review). The annotation input system is an integral part of the registry. It’s been implemented reusing the existing Data Stewardship Wizards, now available as the DOME Wizard – see the preprint article on the DOME Registry.

Beyond the infrastructure, there has been a particular effort that the DOME recommendations are adopted by particular communities. Key examples: 

Task force 2:

We addressed the task of identifying and annotating gold standard datasets for ML through the 2022 and 2023 ELIXIR Europe BioHackathons, and a 2024 online Hackathon, coordinated by EMBL-EBI's BioModels repository. We addressed challenges associated with the accessibility, reproducibility and reuse of ML models in life sciences and medicine, noting that these models are scattered across various platforms and often lack sufficient metadata annotation for informed reuse. 

This resulted in the definition of a formalised protocol for the FAIReR (Findable, Accessible, Interoperable, Reusable and Reproducible) sharing of ML models, their metadata and their validation via BioModels. This protocol consists of eight essential steps: sharing model training code, providing dataset information, reproducing figures, reporting model evaluation metrics, sharing trained models, including Dockerfiles, providing model metadata and ensuring FAIR dissemination. 

In BioModelsML: Building a FAIR and reproducible collection of machine learning models in life sciences and medicine for easy reuse, we report on a pilot implementation to curate diverse ML models, demonstrating the feasibility of this approach. 

Through incentivised community participation, we aim to build a comprehensive public collection of FAIR ML models in or linked from the BioModels repository. By applying these measures, the protocol aims to enhance the reproducibility and reusability of ML models, reduce the effort needed to reimplement them, maximise their impact and significantly accelerate advancements in life sciences and medicine.

Task force 3:

The synthetic data activities have yielded several significant outcomes, detailed in key publications by group members. 

The 2023 article Infrastructure for Synthetic Health Data outlines the development of essential infrastructure to generate and manage synthetic health data, addressing critical challenges related to data privacy, data quality and data applicability. This work reflects our commitment to engage with the ELIXIR community to advance its ML resources and expertise.

These advancements in ML for synthetic data generation are particularly important in life sciences, in the context of infectious disease research, for example, as explored in the 2024 publication Synthetic data: How could it be used for infectious disease research? This work highlights the potential of synthetic data to revolutionise the field by providing high-quality, diverse datasets to model and predict disease outbreaks and evaluate interventions. 

These outcomes highlight the pivotal role of Task force 3 in harnessing and advancing synthetic data technologies, providing essential tools and methodologies that bolster research capabilities across multiple life science domains, fully aligning with the ELIXIR work program objectives.
 

Group chairs

Image
Fotis Psomopoulos
Fotis Psomopoulos
(ELIXIR Greece)
Image
Silvio Tosatto
Silvio Tosatto
(ELIXIR Italy)
Image
user profile image
Leyla Garcia