Looking for darwin in genomic sequences: Validity and success depends on the relationship between model and data

Christopher T. Jones, Edward Susko, Joseph P. Bielawski

Research output: Chapter in Book/Report/Conference proceedingChapter

2 Citations (Scopus)

Abstract

Codon substitution models (CSMs) are commonly used to infer the history of natural section for a set of protein-coding sequences, often with the explicit goal of detecting the signature of positive Darwinian selection. However, the validity and success of CSMs used in conjunction with the maximum likelihood (ML) framework is sometimes challenged with claims that the approach might too often support false conclusions. In this chapter, we use a case study approach to identify four legitimate statistical difficulties associated with inference of evolutionary events using CSMs. These include: (1) model misspecification, (2) low information content, (3) the confounding of processes, and (4) phenomenological load, or PL. While past criticisms of CSMs can be connected to these issues, the historical critiques were often misdirected, or overstated, because they failed to recognize that the success of any model-based approach depends on the relationship between model and data. Here, we explore this relationship and provide a candid assessment of the limitations of CSMs to extract historical information from extant sequences. To aid in this assessment, we provide a brief overview of: (1) a more realistic way of thinking about the process of codon evolution framed in terms of population genetic parameters, and (2) a novel presentation of the ML statistical framework. We then divide the development of CSMs into two broad phases of scientific activity and show that the latter phase is characterized by increases in model complexity that can sometimes negatively impact inference of evolutionary mechanisms. Such problems are not yet widely appreciated by the users of CSMs. These problems can be avoided by using a model that is appropriate for the data; but, understanding the relationship between the data and a fitted model is a difficult task. We argue that the only way to properly understand that relationship is to perform in silico experiments using a generating process that can mimic the data as closely as possible. The mutation-selection modeling framework (MutSel) is presented as the basis of such a generating process. We contend that if complex CSMs continue to be developed for testing explicit mechanistic hypotheses, then additional analyses such as those described in here (e.g., penalized LRTs and estimation of PL) will need to be applied alongside the more traditional inferential methods.

Original languageEnglish
Title of host publicationMethods in Molecular Biology
PublisherHumana Press Inc.
Pages399-426
Number of pages28
DOIs
Publication statusPublished - 2019

Publication series

NameMethods in Molecular Biology
Volume1910
ISSN (Print)1064-3745
ISSN (Electronic)1940-6029

Bibliographical note

Publisher Copyright:
© The Author(s) 2019.

ASJC Scopus Subject Areas

  • Molecular Biology
  • Genetics

PubMed: MeSH publication types

  • Journal Article

Fingerprint

Dive into the research topics of 'Looking for darwin in genomic sequences: Validity and success depends on the relationship between model and data'. Together they form a unique fingerprint.

Cite this