Compact Data Structures for Computational Genomics

Gagie, Travis (PI)

Dalhousie University

Projet: Research project

Description

The largest genomic databases now occupy hundreds of terabytes uncompressed. They are an invaluable resource for researchers and physicians but a challenge for computer scientists because standard tools become impractical at such scales. We can compress these databases extremely well because they are repetitive, but we must still modify our tools to work with their compressed representations. For example, mapping assembly is a basic task in bioinformatics: indexing a reference genome we have already assembled and then, for each read of a new genome, quickly finding the substrings of the reference it matches most closely. Even when a genome is 99.9% the same as the reference, though, genomes are so big that the 0.1% difference is enough to cause many thousands of reads to remain unmapped. It helps significantly to index a "pan-genome" reference representing many genomes but popular tools for read-mapping cannot handle databases containing more than a few genomes, because they do not take advantage of the genomes’ similarity. We recently devised a tool called the r-index that can index hundreds or thousands of human genomes using reasonable memory space. We have released a basic practical implementation, which will serve as a starting point for this project. Adding new functionalities to the r-index is more difficult than adding them to current read-mappers, however, because now we need to compress even small auxiliary data structures. First, we will extend the r-index to find maximal exact matches and combine it with recent software that stores a pan-genome as a variation graph. At the moment that software tries to index a graph but it can just as well index a genomic database and store a mapping from each genome to a path in the graph. This way, given a read, we first map it to matching substrings in the genomes and then map those to paths in the variation graph. Second, we will extend the r-index such that, when there are many matches for a read in our genomic database that all map to the same path in the graph, it will return only summary statistics instead of reporting them individually. Such queries are essentially document listing and document counting over highly repetitive collections, something I have previously studied. Third, we will investigate further Wheeler graphs, a framework for data structures based on the Burrows-Wheeler Transform. Such data structures include the FM-index underlying popular read-mappers; some compact representations of de Bruijn graphs; and variation-graph indexes. Our framework has already inspired a method for further compressing the r-index and for compressed indexing of readsets. We will work to integrate practical implementations of an extended r-index into commercial and research sequencing pipelines. I am confident this project will help computer scientists meet some of the challenges posed by the flood of genomic data, and thus help biologists and physicians realize its potential.

Statut	Actif
Date de début/de fin réelle	1/1/23 → …

Financement

Natural Sciences and Engineering Research Council of Canada: 25 196,00 $ US

ASJC Scopus Subject Areas

Genetics
Molecular Biology
Physics and Astronomy(all)
Chemistry(all)
Agricultural and Biological Sciences(all)
Engineering(all)
Management of Technology and Innovation

Accéder au projet

https://open.canada.ca/data/en/dataset/c1b0f627-8c29-427c-ab73-33968ad9176e?id=RGPIN-2020-07185

Compact Data Structures for Computational Genomics

Détails sur le projet

Description

Financement

ASJC Scopus Subject Areas

Accéder au projet