We will ultimately want to be able to partition our data: assign one model to one subset of sites and a different model to the remaining sites. This tutorial adopts the definition of the term partition as a wall or set of walls that subdivide something. In this case, a partition refers to the way in which sites are divided into subsets, each getting its own model. The individual subsets are often (confusingly) also called partitions, but I will use the term data subset to represent a “room” and reserve partition for the “walls” separating the rooms. A partition or a partitioning will be used to denote a particular number and placement of walls.
For example, suppose we want to partition the sequences for a protein coding gene into first, second, and third codon positions. We will use three subset commands to do this, either on the command line or (more reasonably) in our strom.conf file:
subset = first:1-60\3
subset = second:2-60\3
subset = third:3-60\3
These subset
commands provide arbitrary names (first
, second
, and third
) for the three data subsets composing the partition, and also indicate which sites fall in each subset. The \3
indicates the step size, also known as the stride, which is by default 1. Thus, 2-60\3
means 2, 2+3, 2+6, 2+9, ...
.
The Strom
class will take responsibility for grabbing the first:1-60\3
, second:2-60\3
, and third:3-60\3
strings from the command line or strom.conf file; the Partition
class handles interpreting these strings. The Partition
class will store a vector of tuples, called _subset_ranges
, with each tuple comprising four integers: begin site (inclusive), end site (inclusive), step size, and subset index. Each tuple in _subset_ranges
assigns a subset index (0, 1, 2, …, _num_subsets
-1) to a chunk of sites, where a chunk can be either a single site (begin=end), a range of contiguous sites (end > begin, step = 1), or a range of non-contiguous sites that can be defined by stepping from begin to end by steps greater than 1.
For example, assume the following subset
statements are present in the strom.conf file:
subset = a:1-1234
subset = b:1235,2001-2050
subset = c:1236-2000
subset = d:2051-3000\2
subset = e:2052-3000\2
The _subset_ranges
vector would end up looking like this (sorted by beginning site, not subset index):
index tuple
0 ( 1, 1234, 1, 0)
1 (1235, 1235, 1, 1)
2 (1236, 2000, 1, 2)
3 (2001, 2050, 1, 1)
4 (2051, 3000, 2, 3)
5 (2052, 3000, 2, 4)
The _subset_names
vector holds the subset names defined in the part
commands:
index name
0 "a"
1 "b"
2 "c"
3 "d"
4 "e"
The above examples of partition definitions are fine as long as we can assume that data is always in the form of sequences of nucleotides. What if we wanted to interpret the nucleotide sequences as codons rather than individual nucleotides? It would also be helpful to be explicit about other possible data types, such as protein (amino acid sequences) or discrete morphological data. To handle these cases, we will add an optional data type specification to our subset definition. If the data type is omitted by the user, then that subset is assumed to contain nucleotide sequence data.
For example, the following definitions are possible:
subset = rbcL[codon,plantplastid]:1-20
subset = tufA[protein]:21-40
subset = morph[standard]:41-45
subset = rDNA:46-1845
The subset named rbcL
holds data for 20 codon “sites” (but note that the 20 codon sites correspond to 60 nucleotide sites in the data file). The associated genetic code is supplied after the word codon
(the comma separator is required).
The subset named tufA
holds data for 20 amino acid sites.
The subset labeled morph
holds data from 5 discrete morphological characters.
Finally, the subset rDNA
holds data for 1800 nucleotide sites (the nucleotide data type is assumed because the data type was not specified).
We will implement the subset
command in strom.hpp as a command line (or control file) option, but for now let’s create the DataType
and GeneticCode
classes that will handle storing part of the information provided by the user in these subset
commands. Both of these classes will assist the Partition
class (next up), which parses the string that the user supplies to the subset command and stores all information about data partitioning.
GeneticCode
is a lightweight object that stores information about the genetic codes that are recognized by the program. Its main function is to translate an index into the list of all 64 codons into an index into a list of codons that are relevant to the particular genetic code assumed. Stop codons are not considered valid states in a codon model and are thus removed, so (for example) the standard (“universal”) genetic code has only 61 states even though there are 64 possible codons (3 are stop codons). GeneticCode objects also provide the amino acid corresponding to each codon, which is important in determining synonymous vs. non-synonymous transitions when constructing the instantaneous rate matrix.
Create a new C++ class named GeneticCode
and add it to your project as the header file genetic_code.hpp. Below is the class declaration. The body of each member function will be described separately (you should add each of these member function bodies just above the right curly bracket that terminates the namespace block).
#pragma once
#include "xstrom.hpp"
#include <boost/algorithm/string.hpp>
namespace strom {
class Data;
class Model;
class QMatrix;
class GeneticCode {
friend class Data;
friend class Model;
friend class QMatrix;
public:
typedef std::map<int, int> genetic_code_map_t;
typedef std::map<char, unsigned> amino_acid_map_t;
typedef std::vector<unsigned> amino_acid_vect_t;
typedef std::vector<std::string> codon_vect_t;
typedef std::vector<char> amino_acid_symbol_vect_t;
typedef std::map<std::string, std::string> genetic_code_definitions_t;
typedef std::vector<std::string> genetic_code_names_t;
GeneticCode();
GeneticCode(std::string name);
~GeneticCode();
std::string getGeneticCodeName() const;
void useGeneticCode(std::string name);
unsigned getNumNonStopCodons() const;
int getStateCode(int triplet_index) const;
char getAminoAcidAbbrev(unsigned aa_index) const;
void copyCodons(std::vector<std::string> & codon_vect) const;
void copyAminoAcids(std::vector<unsigned> & aa_vect) const;
static genetic_code_names_t getRecognizedGeneticCodeNames();
static bool isRecognizedGeneticCodeName(const std::string & name);
static void ensureGeneticCodeNameIsValid(const std::string & name);
private:
void buildGeneticCodeTranslators();
std::string _genetic_code_name;
genetic_code_map_t _genetic_code_map;
amino_acid_map_t _amino_acid_map;
amino_acid_vect_t _amino_acids;
codon_vect_t _codons;
const amino_acid_symbol_vect_t _all_amino_acids = {'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'};
const std::vector<std::string> _all_codons = {"AAA", "AAC", "AAG", "AAT", "ACA", "ACC", "ACG", "ACT", "AGA", "AGC", "AGG", "AGT", "ATA", "ATC", "ATG", "ATT", "CAA", "CAC", "CAG", "CAT", "CCA", "CCC", "CCG", "CCT", "CGA", "CGC", "CGG", "CGT", "CTA", "CTC", "CTG", "CTT", "GAA", "GAC", "GAG", "GAT", "GCA", "GCC", "GCG", "GCT", "GGA", "GGC", "GGG", "GGT", "GTA", "GTC", "GTG", "GTT", "TAA", "TAC", "TAG", "TAT", "TCA", "TCC", "TCG", "TCT", "TGA", "TGC", "TGG", "TGT", "TTA", "TTC", "TTG", "TTT"};
static genetic_code_definitions_t _definitions;
public:
typedef std::shared_ptr<GeneticCode> SharedPtr;
};
// member function bodies go here
}
Two constructors are provided: the first accepts no arguments and sets itself up to represent the standard genetic code; the second accepts a genetic code name string and sets itself to represent that genetic code. Both constructors call the member function buildGeneticCodeTranslators
(described below) to do all of the work except setting _genetic_code_name
.
inline GeneticCode::GeneticCode() {
//std::cout << "Constructing a standard GeneticCode" << std::endl;
useGeneticCode("standard");
}
inline GeneticCode::GeneticCode(std::string name) {
//std::cout << "Constructing a " << name << " GeneticCode" << std::endl;
useGeneticCode(name);
}
inline GeneticCode::~GeneticCode() {
//std::cout << "Destroying a GeneticCode" << std::endl;
}
This function can be used to obtain the name of the genetic code currently represented by this object.
inline std::string GeneticCode::getGeneticCodeName() const {
return _genetic_code_name;
}
This function can be used to change which genetic code the GeneticCode
object represents after construction. This is in fact the same method that the constructors use.
inline void GeneticCode::useGeneticCode(std::string name) {
_genetic_code_name = name;
buildGeneticCodeTranslators();
}
This returns the number of codons that are not stop codons in the genetic code represented by this GeneticCode
object. This number is simply the length of the _codons
vector because that vector stores only non-stop codons.
inline unsigned GeneticCode::getNumNonStopCodons() const {
return (unsigned)_codons.size();
}
Given an index into a vector of all 64 codons (triplet_index
), this function returns the index of the same triplet in the _codons
vector, which is a list of all non-stop codons for the genetic code represented by this GeneticCode
object. This function will be used to map the 64-state codes used by the Nexus Class Library (NCL) onto the states used by our codon models (codon models do not include the stop codons as states).
inline int GeneticCode::getStateCode(int triplet_index) const {
return _genetic_code_map.at(triplet_index);
}
This function returns the single-letter abbreviation for an amino acid given its (0-offset) index in the _all_amino_acids
vector.
inline char GeneticCode::getAminoAcidAbbrev(unsigned aa_index) const {
return _all_amino_acids[aa_index];
}
This function copies its own_codons
vector to the supplied vector reference codon_vect
. This function will later be used by codon models, which need to know all valid codon states for the appropriate genetic code.
inline void GeneticCode::copyCodons(std::vector<std::string> & codon_vect) const {
codon_vect.resize(_codons.size());
std::copy(_codons.begin(), _codons.end(), codon_vect.begin());
}
This function copies its own_amino_acids
vector to the supplied vector reference aa_vect
. This function will later be used by codon models, which need to know the amino acid for each codon state for the appropriate genetic code in order to determine whether a particular transition represents a synonymous or nonsynonymous substitution.
inline void GeneticCode::copyAminoAcids(std::vector<unsigned> & aa_vect) const {
aa_vect.resize(_amino_acids.size());
std::copy(_amino_acids.begin(), _amino_acids.end(), aa_vect.begin());
}
This function sets up all data members based on the current value of the data member _genetic_code_name. The data member _amino_acid_map
provides the index of a particular amino acid into the _all_amino_acids
vector given its single letter abbreviation. The _codons
and _amino_acids
data members are vectors containing all non-stop codons and the corresponding amino acids, respectively, for the chosen genetic code. Finally, _genetic_code_map provides the codon state code for a nucleotide triplet index. For example, in the standard genetic code, TTT is the 64th triplet but its state code is 61 because 3 triplets represent stop codons and are not counted as states.
inline void GeneticCode::buildGeneticCodeTranslators() {
_amino_acid_map.clear();
for (unsigned i = 0; i < 20; ++i) {
char aa = _all_amino_acids[i];
_amino_acid_map[aa] = i;
}
ensureGeneticCodeNameIsValid(_genetic_code_name);
std::string gcode_aa = _definitions[_genetic_code_name]; // e.g. "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"
int k = 0;
int state_code = 0;
_codons.clear();
_amino_acids.clear();
_genetic_code_map.clear();
for (char ch : gcode_aa) {
if (ch != '*') {
_genetic_code_map[k] = state_code++;
_codons.push_back(_all_codons[k]);
_amino_acids.push_back(_amino_acid_map[ch]);
}
++k;
}
}
This is a static member function (which means it can be called even if no objects of the GeneticCode class have been constructed) that builds and returns a vector of genetic code names. The genetic code names used are the keys of the _definitions
map, which, as a static variable, must be instantiated in a source code file (not a header file). In our case, the only source code file is main.cpp, which is where all static variables are initialized.
inline GeneticCode::genetic_code_names_t GeneticCode::getRecognizedGeneticCodeNames() {
genetic_code_names_t names;
for (auto it = _definitions.begin(); it != _definitions.end(); ++it)
names.push_back(it->first);
return names;
}
This is a static function that returns true
if the supplied string name
represents a valid genetic code name stored as a key in the (static) _definitions
vector. Note that name
is converted to lower case before the lookup because all keys in _definitions
are lower case.
inline bool GeneticCode::isRecognizedGeneticCodeName(const std::string & name) {
std::string lcname = name;
boost::to_lower(lcname);
return (_definitions.find(lcname) != _definitions.end());
}
This is a static member function that checks whether the supplied string name
represents a recognized genetic code. If not, a list of valid genetic code names is printed to standard out and an XStrom
exception is thrown.
inline void GeneticCode::ensureGeneticCodeNameIsValid(const std::string & name) {
if (!isRecognizedGeneticCodeName(name)) {
auto valid_genetic_code_names = getRecognizedGeneticCodeNames();
std::cout << "Recognized genetic codes:\n";
for (std::string name : valid_genetic_code_names) {
std::cout << " " << name << "\n";
}
std::cout << std::endl;
throw XStrom(boost::format("%s is not a recognized genetic code") % name);
}
}
DataType
is a lightweight object that stores the type of data for a partition along with a small amount of associated information, such as the number of discrete states for a data type and the genetic code in the case of the codon data type. A DataType
object will be associated with each data subset defined in the Partition
object.
Create a new C++ class named DataType
and add it to your project as the header file datatype.hpp. Below is the class declaration. The body of each member function will be described separately (you should add each of these member function bodies just above the right curly bracket that terminates the namespace block).
#pragma once
#include "genetic_code.hpp"
#include <boost/format.hpp>
namespace strom {
class DataType {
public:
DataType();
~DataType();
void setNucleotide();
void setCodon();
void setProtein();
void setStandard();
bool isNucleotide() const;
bool isCodon() const;
bool isProtein() const;
bool isStandard() const;
void setStandardNumStates(unsigned nstates);
void setGeneticCodeFromName(std::string genetic_code_name);
void setGeneticCode(GeneticCode::SharedPtr gcode);
unsigned getDataType() const;
unsigned getNumStates() const;
std::string getDataTypeAsString() const;
const GeneticCode::SharedPtr getGeneticCode() const;
static std::string translateDataTypeToString(unsigned datatype);
private:
unsigned _datatype;
unsigned _num_states;
GeneticCode::SharedPtr _genetic_code;
};
// member function bodies go here
}
The only thing out of the ordinary here is the fact that the constructor sets the data type to be nucleotide by default. This is done by calling the setNucleotide
member function.
inline DataType::DataType() : _datatype(0), _num_states(0) {
//std::cout << "Creating a DataType object" << std::endl;
setNucleotide();
}
inline DataType::~DataType() {
//std::cout << "Destroying a DataType object" << std::endl;
}
The four functions setNucleotide
, setCodon
, setProtein
, and setStandard
are used to change the data type to one of the four recognized types: nucleotide, codon, protein, and standard. The data member _datatype
is set to a unique value for each data type, and _num_states
is set to the appropriate number of states. The codon data type makes no sense without an associated genetic code, which is set to standard
(i.e. the “universal” code) by default and can be changed later using the setGeneticCode
member function. The number of states for the standard data type (not to be confused with the standard genetic code!) is set to 2 initially and can be dhanged later using the setStandardNumStates
member function.
inline void DataType::setNucleotide() {
_datatype = 1;
_num_states = 4;
_genetic_code = nullptr;
}
inline void DataType::setCodon() {
_datatype = 2;
_genetic_code = GeneticCode::SharedPtr(new GeneticCode("standard"));
_num_states = _genetic_code->getNumNonStopCodons();
}
inline void DataType::setProtein() {
_datatype = 3;
_num_states = 20;
_genetic_code = nullptr;
}
inline void DataType::setStandard() {
_datatype = 4;
_num_states = 2;
_genetic_code = nullptr;
}
The four member functions isNucleotide
, isCodon
, isProtein
, and isStandard
allow you to query the data type to determine which of the four possible data types it represents.
inline bool DataType::isNucleotide() const {
return (_datatype == 1);
}
inline bool DataType::isCodon() const {
return (_datatype == 2);
}
inline bool DataType::isProtein() const {
return (_datatype == 3);
}
inline bool DataType::isStandard() const {
return (_datatype == 4);
}
The setGeneticCodeFromName
member function takes a string argument representing the genetic code to associate with the codon data type. The string suppled should be one of the 17 names returned by the GeneticCode::getRecognizedGeneticCodeNames
function: standard
, vertmito
, yeastmito
, moldmito
, invertmito
, ciliate
, echinomito
, euplotid
, plantplastid
, altyeast
, ascidianmito
, altflatwormmito
, blepharismamacro
, chlorophyceanmito
, trematodemito
, scenedesmusmito
, and thraustochytriummito
. These are defined as the keys of the static map GeneticCode::_definitions
, which is initialized in the main.cpp file.
The setGeneticCode
member function allows you to set the genetic code using an existing GeneticCode object by passing in a shared pointer to the existing object.
The setStandardNumStates
member function takes an integer argument representing the number of discrete character states for a standard data type and sets _num_states
to that value. Note that it also sets _datatype
, which means you can use this function instead of setStandard
, which sets _num_states
to 2.
inline void DataType::setGeneticCodeFromName(std::string genetic_code_name) {
assert(isCodon());
_genetic_code = GeneticCode::SharedPtr(new GeneticCode(genetic_code_name));
}
inline void DataType::setGeneticCode(GeneticCode::SharedPtr gcode) {
assert(isCodon());
assert(gcode);
_genetic_code = gcode;
}
inline void DataType::setStandardNumStates(unsigned nstates) {
_datatype = 4;
_num_states = nstates;
_genetic_code = nullptr;
}
These are accessor functions that return the values stored in the private data members _datatype
, _num_states
, and _genetic_code
, respectively.
inline unsigned DataType::getDataType() const {
return _datatype;
}
inline unsigned DataType::getNumStates() const {
return _num_states;
}
inline const GeneticCode::SharedPtr DataType::getGeneticCode() const {
assert(isCodon());
return _genetic_code;
}
The translateDataTypeToString
function is declared static, which means it can be called without reference to any particular DataType
object. It is used to translate the arbitrary integer code for a datatype into a meaningful name (e.g., “nucleotide”, “codon”, “protein”, or “standard”).
The getDataTypeAsString
function simply calls translateDataTypeToString
, passing its stored _datatype
value. This function can be used to query a particular DataType
object for the name of its particular data type. For the codon data type, the string returned also specifies the name of the associated genetic code.
inline std::string DataType::getDataTypeAsString() const {
std::string s = translateDataTypeToString(_datatype);
if (isCodon())
s += boost::str(boost::format(",%s") % _genetic_code->getGeneticCodeName());
return s;
}
inline std::string DataType::translateDataTypeToString(unsigned datatype) {
assert(datatype == 1 || datatype == 2 || datatype == 3 || datatype == 4);
if (datatype == 1)
return std::string("nucleotide");
else if (datatype == 2)
return std::string("codon");
else if (datatype == 3)
return std::string("protein");
else
return std::string("standard");
}