Emerging Data Standards for Phylogenomics Research
Christian M Zmasek
Abstract
The term phylogenomics was initially used to describe the application of phylogenetic information for gene function analysis (Eisen, 1998). More recently, the expression has also been employed to describe attempts to reconstruct the evolutionary history of species based on whole genome analyses (for example, Dunn et al., 2008), as well as various types of studies involving the intersection of genomics and phylogenetics.
A common feature of phylogenomic analyses is the requirement to annotate biological entities such as molecular sequences, phylogenetic trees with data fields such as sequence identifiers, taxonomic data, and – possibly multiple – support values. Simple examples of this are gene trees which have been reconciled with species trees. Such trees have nodes which are at least associated with sequence identifiers as well as taxonomic data. Furthermore, nodes might contain information about whether they represent gene duplications or speciation events. Very similar examples are phylogeographic studies which involve phylogenetic trees with nodes associated with taxonomic as well as geographical information. Currently, there is no widely accepted data standard for phylogenomic information. In general, individual research groups develop their own ad hoc approaches to manage their data. This practice is problematic for data storage, submission, retrieval, exchange, as well as (meta-) analyses.
Recently, several standards have been proposed to meet the data exchange, annotation, and metadata vocabulary needs of phylogenomics research. Following an overview of these, the application of one such proposed standard (phyloXML; http://www.phyloxml.org) in a phylogenomics study relating phylogenetic data, genomic data, as well as protein architectures, is discussed as a practical example and model.
References:
Eisen JA (1998). Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research, 8, 163-167.
Dunn CW et al. (2008). Broad phylogenomic sampling improves resolution of the animal tree of life. Nature, 452, 745-749.
A common feature of phylogenomic analyses is the requirement to annotate biological entities such as molecular sequences, phylogenetic trees with data fields such as sequence identifiers, taxonomic data, and – possibly multiple – support values. Simple examples of this are gene trees which have been reconciled with species trees. Such trees have nodes which are at least associated with sequence identifiers as well as taxonomic data. Furthermore, nodes might contain information about whether they represent gene duplications or speciation events. Very similar examples are phylogeographic studies which involve phylogenetic trees with nodes associated with taxonomic as well as geographical information. Currently, there is no widely accepted data standard for phylogenomic information. In general, individual research groups develop their own ad hoc approaches to manage their data. This practice is problematic for data storage, submission, retrieval, exchange, as well as (meta-) analyses.
Recently, several standards have been proposed to meet the data exchange, annotation, and metadata vocabulary needs of phylogenomics research. Following an overview of these, the application of one such proposed standard (phyloXML; http://www.phyloxml.org) in a phylogenomics study relating phylogenetic data, genomic data, as well as protein architectures, is discussed as a practical example and model.
References:
Eisen JA (1998). Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research, 8, 163-167.
Dunn CW et al. (2008). Broad phylogenomic sampling improves resolution of the animal tree of life. Nature, 452, 745-749.