Looking for darwin in genomic sequences: Validity and success depends on the relationship between model and data

Christopher T. Jones; Edward Susko; Joseph P. Bielawski

doi:10.1007/978-1-4939-9074-0_13

Looking for darwin in genomic sequences: Validity and success depends on the relationship between model and data

Christopher T. Jones, Edward Susko, Joseph P. Bielawski

Medicine

Research output: Chapter in Book/Report/Conference proceeding › Chapter

2 Citations (Scopus)

Abstract

Codon substitution models (CSMs) are commonly used to infer the history of natural section for a set of protein-coding sequences, often with the explicit goal of detecting the signature of positive Darwinian selection. However, the validity and success of CSMs used in conjunction with the maximum likelihood (ML) framework is sometimes challenged with claims that the approach might too often support false conclusions. In this chapter, we use a case study approach to identify four legitimate statistical difficulties associated with inference of evolutionary events using CSMs. These include: (1) model misspecification, (2) low information content, (3) the confounding of processes, and (4) phenomenological load, or PL. While past criticisms of CSMs can be connected to these issues, the historical critiques were often misdirected, or overstated, because they failed to recognize that the success of any model-based approach depends on the relationship between model and data. Here, we explore this relationship and provide a candid assessment of the limitations of CSMs to extract historical information from extant sequences. To aid in this assessment, we provide a brief overview of: (1) a more realistic way of thinking about the process of codon evolution framed in terms of population genetic parameters, and (2) a novel presentation of the ML statistical framework. We then divide the development of CSMs into two broad phases of scientific activity and show that the latter phase is characterized by increases in model complexity that can sometimes negatively impact inference of evolutionary mechanisms. Such problems are not yet widely appreciated by the users of CSMs. These problems can be avoided by using a model that is appropriate for the data; but, understanding the relationship between the data and a fitted model is a difficult task. We argue that the only way to properly understand that relationship is to perform in silico experiments using a generating process that can mimic the data as closely as possible. The mutation-selection modeling framework (MutSel) is presented as the basis of such a generating process. We contend that if complex CSMs continue to be developed for testing explicit mechanistic hypotheses, then additional analyses such as those described in here (e.g., penalized LRTs and estimation of PL) will need to be applied alongside the more traditional inferential methods.

Original language	English
Title of host publication	Methods in Molecular Biology
Publisher	Humana Press Inc.
Pages	399-426
Number of pages	28
DOIs	https://doi.org/10.1007/978-1-4939-9074-0_13
Publication status	Published - 2019

Publication series

Name	Methods in Molecular Biology
Volume	1910
ISSN (Print)	1064-3745
ISSN (Electronic)	1940-6029

Bibliographical note

Publisher Copyright:
© The Author(s) 2019.

ASJC Scopus Subject Areas

Molecular Biology
Genetics

PubMed: MeSH publication types

Journal Article

Access to Document

10.1007/978-1-4939-9074-0_13

Cite this

Jones, C. T., Susko, E., & Bielawski, J. P. (2019). Looking for darwin in genomic sequences: Validity and success depends on the relationship between model and data. In Methods in Molecular Biology (pp. 399-426). (Methods in Molecular Biology; Vol. 1910). Humana Press Inc.. https://doi.org/10.1007/978-1-4939-9074-0_13

@inbook{a2fd3696dcfd4794ab88aa5f08b0a04c,

title = "Looking for darwin in genomic sequences: Validity and success depends on the relationship between model and data",

abstract = "Codon substitution models (CSMs) are commonly used to infer the history of natural section for a set of protein-coding sequences, often with the explicit goal of detecting the signature of positive Darwinian selection. However, the validity and success of CSMs used in conjunction with the maximum likelihood (ML) framework is sometimes challenged with claims that the approach might too often support false conclusions. In this chapter, we use a case study approach to identify four legitimate statistical difficulties associated with inference of evolutionary events using CSMs. These include: (1) model misspecification, (2) low information content, (3) the confounding of processes, and (4) phenomenological load, or PL. While past criticisms of CSMs can be connected to these issues, the historical critiques were often misdirected, or overstated, because they failed to recognize that the success of any model-based approach depends on the relationship between model and data. Here, we explore this relationship and provide a candid assessment of the limitations of CSMs to extract historical information from extant sequences. To aid in this assessment, we provide a brief overview of: (1) a more realistic way of thinking about the process of codon evolution framed in terms of population genetic parameters, and (2) a novel presentation of the ML statistical framework. We then divide the development of CSMs into two broad phases of scientific activity and show that the latter phase is characterized by increases in model complexity that can sometimes negatively impact inference of evolutionary mechanisms. Such problems are not yet widely appreciated by the users of CSMs. These problems can be avoided by using a model that is appropriate for the data; but, understanding the relationship between the data and a fitted model is a difficult task. We argue that the only way to properly understand that relationship is to perform in silico experiments using a generating process that can mimic the data as closely as possible. The mutation-selection modeling framework (MutSel) is presented as the basis of such a generating process. We contend that if complex CSMs continue to be developed for testing explicit mechanistic hypotheses, then additional analyses such as those described in here (e.g., penalized LRTs and estimation of PL) will need to be applied alongside the more traditional inferential methods.",

author = "Jones, {Christopher T.} and Edward Susko and Bielawski, {Joseph P.}",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2019.",

year = "2019",

doi = "10.1007/978-1-4939-9074-0_13",

language = "English",

series = "Methods in Molecular Biology",

publisher = "Humana Press Inc.",

pages = "399--426",

booktitle = "Methods in Molecular Biology",

}

TY - CHAP

T1 - Looking for darwin in genomic sequences

T2 - Validity and success depends on the relationship between model and data

AU - Jones, Christopher T.

AU - Susko, Edward

AU - Bielawski, Joseph P.

PY - 2019

Y1 - 2019

N2 - Codon substitution models (CSMs) are commonly used to infer the history of natural section for a set of protein-coding sequences, often with the explicit goal of detecting the signature of positive Darwinian selection. However, the validity and success of CSMs used in conjunction with the maximum likelihood (ML) framework is sometimes challenged with claims that the approach might too often support false conclusions. In this chapter, we use a case study approach to identify four legitimate statistical difficulties associated with inference of evolutionary events using CSMs. These include: (1) model misspecification, (2) low information content, (3) the confounding of processes, and (4) phenomenological load, or PL. While past criticisms of CSMs can be connected to these issues, the historical critiques were often misdirected, or overstated, because they failed to recognize that the success of any model-based approach depends on the relationship between model and data. Here, we explore this relationship and provide a candid assessment of the limitations of CSMs to extract historical information from extant sequences. To aid in this assessment, we provide a brief overview of: (1) a more realistic way of thinking about the process of codon evolution framed in terms of population genetic parameters, and (2) a novel presentation of the ML statistical framework. We then divide the development of CSMs into two broad phases of scientific activity and show that the latter phase is characterized by increases in model complexity that can sometimes negatively impact inference of evolutionary mechanisms. Such problems are not yet widely appreciated by the users of CSMs. These problems can be avoided by using a model that is appropriate for the data; but, understanding the relationship between the data and a fitted model is a difficult task. We argue that the only way to properly understand that relationship is to perform in silico experiments using a generating process that can mimic the data as closely as possible. The mutation-selection modeling framework (MutSel) is presented as the basis of such a generating process. We contend that if complex CSMs continue to be developed for testing explicit mechanistic hypotheses, then additional analyses such as those described in here (e.g., penalized LRTs and estimation of PL) will need to be applied alongside the more traditional inferential methods.

AB - Codon substitution models (CSMs) are commonly used to infer the history of natural section for a set of protein-coding sequences, often with the explicit goal of detecting the signature of positive Darwinian selection. However, the validity and success of CSMs used in conjunction with the maximum likelihood (ML) framework is sometimes challenged with claims that the approach might too often support false conclusions. In this chapter, we use a case study approach to identify four legitimate statistical difficulties associated with inference of evolutionary events using CSMs. These include: (1) model misspecification, (2) low information content, (3) the confounding of processes, and (4) phenomenological load, or PL. While past criticisms of CSMs can be connected to these issues, the historical critiques were often misdirected, or overstated, because they failed to recognize that the success of any model-based approach depends on the relationship between model and data. Here, we explore this relationship and provide a candid assessment of the limitations of CSMs to extract historical information from extant sequences. To aid in this assessment, we provide a brief overview of: (1) a more realistic way of thinking about the process of codon evolution framed in terms of population genetic parameters, and (2) a novel presentation of the ML statistical framework. We then divide the development of CSMs into two broad phases of scientific activity and show that the latter phase is characterized by increases in model complexity that can sometimes negatively impact inference of evolutionary mechanisms. Such problems are not yet widely appreciated by the users of CSMs. These problems can be avoided by using a model that is appropriate for the data; but, understanding the relationship between the data and a fitted model is a difficult task. We argue that the only way to properly understand that relationship is to perform in silico experiments using a generating process that can mimic the data as closely as possible. The mutation-selection modeling framework (MutSel) is presented as the basis of such a generating process. We contend that if complex CSMs continue to be developed for testing explicit mechanistic hypotheses, then additional analyses such as those described in here (e.g., penalized LRTs and estimation of PL) will need to be applied alongside the more traditional inferential methods.

UR - http://www.scopus.com/inward/record.url?scp=85068843553&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068843553&partnerID=8YFLogxK

U2 - 10.1007/978-1-4939-9074-0_13

DO - 10.1007/978-1-4939-9074-0_13

M3 - Chapter

C2 - 31278672

AN - SCOPUS:85068843553

T3 - Methods in Molecular Biology

SP - 399

EP - 426

BT - Methods in Molecular Biology

PB - Humana Press Inc.

ER -

Looking for darwin in genomic sequences: Validity and success depends on the relationship between model and data

Abstract

Publication series

Bibliographical note

ASJC Scopus Subject Areas

PubMed: MeSH publication types

Access to Document

Other files and links

Fingerprint

Cite this