9.1 Create the DataType and GeneticCode classes

(Win version)

< 9.0 | 9.1 | 9.2 >

Defining a data partition

We will ultimately want to be able to partition our data: assign one model to one subset of sites and a different model to the remaining sites. This tutorial adopts the definition of the term partition as a wall or set of walls that subdivide something. In this case, a partition refers to the way in which sites are divided into subsets, each getting its own model. The individual subsets are often (confusingly) also called partitions, but I will use the term data subset to represent a “room” and reserve partition for the “walls” separating the rooms. A partition or a partitioning will be used to denote a particular number and placement of walls.

For example, suppose we want to partition the sequences for a protein coding gene into first, second, and third codon positions. We will use three subset commands to do this, either on the command line or (more reasonably) in our strom.conf file:

subset = first:1-60\3
subset = second:2-60\3
subset = third:3-60\3

These subset commands provide arbitrary names (first, second, and third) for the three data subsets composing the partition, and also indicate which sites fall in each subset. The \3 indicates the step size, also known as the stride, which is by default 1. Thus, 2-60\3 means 2, 2+3, 2+6, 2+9, ....

The Strom class will take responsibility for grabbing the first:1-60\3, second:2-60\3, and third:3-60\3 strings from the command line or strom.conf file; the Partition class handles interpreting these strings. The Partition class will store a vector of tuples, called _subset_ranges, with each tuple comprising four integers: begin site (inclusive), end site (inclusive), step size, and subset index. Each tuple in _subset_ranges assigns a subset index (0, 1, 2, …, _num_subsets-1) to a chunk of sites, where a chunk can be either a single site (begin=end), a range of contiguous sites (end > begin, step = 1), or a range of non-contiguous sites that can be defined by stepping from begin to end by steps greater than 1.

For example, assume the following subset statements are present in the strom.conf file:

subset = a:1-1234
subset = b:1235,2001-2050
subset = c:1236-2000
subset = d:2051-3000\2
subset = e:2052-3000\2

The _subset_ranges vector would end up looking like this (sorted by beginning site, not subset index):

index          tuple
  0     (   1, 1234, 1, 0)
  1     (1235, 1235, 1, 1)
  2     (1236, 2000, 1, 2)
  3     (2001, 2050, 1, 1)
  4     (2051, 3000, 2, 3)
  5     (2052, 3000, 2, 4)

The _subset_names vector holds the subset names defined in the part commands:

index   name
  0     "a"
  1     "b"
  2     "c"
  3     "d"
  4     "e"

Specifying the data type for a subset

The above examples of partition definitions are fine as long as we can assume that data is always in the form of sequences of nucleotides. What if we wanted to interpret the nucleotide sequences as codons rather than individual nucleotides? It would also be helpful to be explicit about other possible data types, such as protein (amino acid sequences) or discrete morphological data. To handle these cases, we will add an optional data type specification to our subset definition. If the data type is omitted by the user, then that subset is assumed to contain nucleotide sequence data.

For example, the following definitions are possible:

subset = rbcL[codon,plantplastid]:1-20
subset = tufA[protein]:21-40
subset = morph[standard]:41-45
subset = rDNA:46-1845

The subset named rbcL holds data for 20 codon “sites” (but note that the 20 codon sites correspond to 60 nucleotide sites in the data file). The associated genetic code is supplied after the word codon (the comma separator is required).

The subset named tufA holds data for 20 amino acid sites.

The subset labeled morph holds data from 5 discrete morphological characters.

Finally, the subset rDNA holds data for 1800 nucleotide sites (the nucleotide data type is assumed because the data type was not specified).

We will implement the subset command in strom.hpp as a command line (or control file) option, but for now let’s create the DataType and GeneticCode classes that will handle storing part of the information provided by the user in these subset commands. Both of these classes will assist the Partition class (next up), which parses the string that the user supplies to the subset command and stores all information about data partitioning.

GeneticCode class

GeneticCode is a lightweight object that stores information about the genetic codes that are recognized by the program. Its main function is to translate an index into the list of all 64 codons into an index into a list of codons that are relevant to the particular genetic code assumed. Stop codons are not considered valid states in a codon model and are thus removed, so (for example) the standard (“universal”) genetic code has only 61 states even though there are 64 possible codons (3 are stop codons). GeneticCode objects also provide the amino acid corresponding to each codon, which is important in determining synonymous vs. non-synonymous transitions when constructing the instantaneous rate matrix.

Create a new C++ class named GeneticCode and add it to your project as the header file genetic_code.hpp. Below is the class declaration. The body of each member function will be described separately (you should add each of these member function bodies just above the right curly bracket that terminates the namespace block).

#pragma once    

#include "xstrom.hpp"
#include &lt;boost/algorithm/string.hpp&gt;

namespace strom {

    class Data;
    class Model;
    class QMatrix;

    class GeneticCode {

        friend class Data;
        friend class Model;
        friend class QMatrix;

        public:

            typedef std::map&lt;int, int&gt;                  genetic_code_map_t;
            typedef std::map&lt;char, unsigned&gt;            amino_acid_map_t;
            typedef std::vector&lt;unsigned&gt;               amino_acid_vect_t;
            typedef std::vector&lt;std::string&gt;            codon_vect_t;
            typedef std::vector&lt;char&gt;                   amino_acid_symbol_vect_t;
            typedef std::map&lt;std::string, std::string&gt;  genetic_code_definitions_t;
            typedef std::vector&lt;std::string&gt;            genetic_code_names_t;
            
        
                                                GeneticCode();
                                                GeneticCode(std::string name);
                                                ~GeneticCode();
        
            std::string                         getGeneticCodeName() const;
            void                                useGeneticCode(std::string name);
        
            unsigned                            getNumNonStopCodons() const;
            int                                 getStateCode(int triplet_index) const;
            char                                getAminoAcidAbbrev(unsigned aa_index) const;

            void                                copyCodons(std::vector&lt;std::string&gt; & codon_vect) const;
            void                                copyAminoAcids(std::vector&lt;unsigned&gt; & aa_vect) const;

            static genetic_code_names_t         getRecognizedGeneticCodeNames();
            static bool                         isRecognizedGeneticCodeName(const std::string & name);
            static void                         ensureGeneticCodeNameIsValid(const std::string & name);

        private:

            void                                buildGeneticCodeTranslators();

            std::string                         _genetic_code_name;
        
            genetic_code_map_t                  _genetic_code_map;
            amino_acid_map_t                    _amino_acid_map;
        
            amino_acid_vect_t                   _amino_acids;
            codon_vect_t                        _codons;
        
            const amino_acid_symbol_vect_t      _all_amino_acids = {'A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'};
            const std::vector&lt;std::string&gt;      _all_codons = {"AAA", "AAC", "AAG", "AAT", "ACA", "ACC", "ACG", "ACT", "AGA", "AGC", "AGG", "AGT", "ATA", "ATC", "ATG", "ATT", "CAA", "CAC", "CAG", "CAT", "CCA", "CCC", "CCG", "CCT", "CGA", "CGC", "CGG", "CGT", "CTA", "CTC", "CTG", "CTT", "GAA", "GAC", "GAG", "GAT", "GCA", "GCC", "GCG", "GCT", "GGA", "GGC", "GGG", "GGT", "GTA", "GTC", "GTG", "GTT", "TAA", "TAC", "TAG", "TAT", "TCA", "TCC", "TCG", "TCT", "TGA", "TGC", "TGG", "TGT", "TTA", "TTC", "TTG", "TTT"};

            static genetic_code_definitions_t   _definitions;

        public:

            typedef std::shared_ptr&lt;GeneticCode&gt; SharedPtr;
    };

    // member function bodies go here
    
}   

Constructors and destructor

Two constructors are provided: the first accepts no arguments and sets itself up to represent the standard genetic code; the second accepts a genetic code name string and sets itself to represent that genetic code. Both constructors call the member function buildGeneticCodeTranslators (described below) to do all of the work except setting _genetic_code_name.

    inline GeneticCode::GeneticCode() { 
        //std::cout &lt;&lt; "Constructing a standard GeneticCode" &lt;&lt; std::endl;
        useGeneticCode("standard");
    }

    inline GeneticCode::GeneticCode(std::string name) {
        //std::cout &lt;&lt; "Constructing a " &lt;&lt; name &lt;&lt; " GeneticCode" &lt;&lt; std::endl;
        useGeneticCode(name);
    }

    inline GeneticCode::~GeneticCode() {
        //std::cout &lt;&lt; "Destroying a GeneticCode" &lt;&lt; std::endl;
    } 

The getGeneticCodeName member function

This function can be used to obtain the name of the genetic code currently represented by this object.

    inline std::string GeneticCode::getGeneticCodeName() const { 
        return _genetic_code_name;
    } 

The useGeneticCode member function

This function can be used to change which genetic code the GeneticCode object represents after construction. This is in fact the same method that the constructors use.

    inline void GeneticCode::useGeneticCode(std::string name) { 
        _genetic_code_name = name;
        buildGeneticCodeTranslators();
    } 

The getNumNonStopCodons member function

This returns the number of codons that are not stop codons in the genetic code represented by this GeneticCode object. This number is simply the length of the _codons vector because that vector stores only non-stop codons.

    inline unsigned GeneticCode::getNumNonStopCodons() const { 
        return (unsigned)_codons.size();
    } 

The getStateCode member function

Given an index into a vector of all 64 codons (triplet_index), this function returns the index of the same triplet in the _codons vector, which is a list of all non-stop codons for the genetic code represented by this GeneticCode object. This function will be used to map the 64-state codes used by the Nexus Class Library (NCL) onto the states used by our codon models (codon models do not include the stop codons as states).

    inline int GeneticCode::getStateCode(int triplet_index) const { 
        return _genetic_code_map.at(triplet_index);
    } 

The getAminoAcidAbbrev member function

This function returns the single-letter abbreviation for an amino acid given its (0-offset) index in the _all_amino_acids vector.

    inline char GeneticCode::getAminoAcidAbbrev(unsigned aa_index) const { 
        return _all_amino_acids[aa_index];
    } 

The copyCodons member function

This function copies its own_codons vector to the supplied vector reference codon_vect. This function will later be used by codon models, which need to know all valid codon states for the appropriate genetic code.

    inline void GeneticCode::copyCodons(std::vector&lt;std::string&gt; & codon_vect) const { 
        codon_vect.resize(_codons.size());
        std::copy(_codons.begin(), _codons.end(), codon_vect.begin());
    } 

The copyAminoAcids member function

This function copies its own_amino_acids vector to the supplied vector reference aa_vect. This function will later be used by codon models, which need to know the amino acid for each codon state for the appropriate genetic code in order to determine whether a particular transition represents a synonymous or nonsynonymous substitution.

    inline void GeneticCode::copyAminoAcids(std::vector&lt;unsigned&gt; & aa_vect) const { 
        aa_vect.resize(_amino_acids.size());
        std::copy(_amino_acids.begin(), _amino_acids.end(), aa_vect.begin());
    } 

The buildGeneticCodeTranslators member function

This function sets up all data members based on the current value of the data member _genetic_code_name. The data member _amino_acid_map provides the index of a particular amino acid into the _all_amino_acids vector given its single letter abbreviation. The _codons and _amino_acids data members are vectors containing all non-stop codons and the corresponding amino acids, respectively, for the chosen genetic code. Finally, _genetic_code_map provides the codon state code for a nucleotide triplet index. For example, in the standard genetic code, TTT is the 64th triplet but its state code is 61 because 3 triplets represent stop codons and are not counted as states.

    inline void GeneticCode::buildGeneticCodeTranslators() { 
        _amino_acid_map.clear();
        for (unsigned i = 0; i &lt; 20; ++i) {
            char aa = _all_amino_acids[i];
            _amino_acid_map[aa] = i;
        }

        ensureGeneticCodeNameIsValid(_genetic_code_name);
        std::string gcode_aa = _definitions[_genetic_code_name];  // e.g. "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"
        
        int k = 0;
        int state_code = 0;
        _codons.clear();
        _amino_acids.clear();
        _genetic_code_map.clear();
        for (char ch : gcode_aa) {
            if (ch != '*') {
                _genetic_code_map[k] = state_code++;
                _codons.push_back(_all_codons[k]);
                _amino_acids.push_back(_amino_acid_map[ch]);
            }
            ++k;
        }
    } 

The getRecognizedGeneticCodeNames member function

This is a static member function (which means it can be called even if no objects of the GeneticCode class have been constructed) that builds and returns a vector of genetic code names. The genetic code names used are the keys of the _definitions map, which, as a static variable, must be instantiated in a source code file (not a header file). In our case, the only source code file is main.cpp, which is where all static variables are initialized.

    inline GeneticCode::genetic_code_names_t GeneticCode::getRecognizedGeneticCodeNames() { 
        genetic_code_names_t names;
        for (auto it = _definitions.begin(); it != _definitions.end(); ++it)
            names.push_back(it-&gt;first);
        return names;
    } 

The isRecognizedGeneticCodeName member function

This is a static function that returns true if the supplied string name represents a valid genetic code name stored as a key in the (static) _definitions vector. Note that name is converted to lower case before the lookup because all keys in _definitions are lower case.

    inline bool GeneticCode::isRecognizedGeneticCodeName(const std::string & name) { 
        std::string lcname = name;
        boost::to_lower(lcname);
        return (_definitions.find(lcname) != _definitions.end());
    } 

The ensureGeneticCodeNameIsValid member function

This is a static member function that checks whether the supplied string name represents a recognized genetic code. If not, a list of valid genetic code names is printed to standard out and an XStrom exception is thrown.

    inline void GeneticCode::ensureGeneticCodeNameIsValid(const std::string & name) { 
        if (!isRecognizedGeneticCodeName(name)) {
            auto valid_genetic_code_names = getRecognizedGeneticCodeNames();
            std::cout &lt;&lt; "Recognized genetic codes:\n";
            for (std::string name : valid_genetic_code_names) {
                std::cout &lt;&lt; "  " &lt;&lt; name &lt;&lt; "\n";
            }
            std::cout &lt;&lt; std::endl;
            throw XStrom(boost::format("%s is not a recognized genetic code") % name);
        }
    } 

DataType class

DataType is a lightweight object that stores the type of data for a partition along with a small amount of associated information, such as the number of discrete states for a data type and the genetic code in the case of the codon data type. A DataType object will be associated with each data subset defined in the Partition object.

Create a new C++ class named DataType and add it to your project as the header file datatype.hpp. Below is the class declaration. The body of each member function will be described separately (you should add each of these member function bodies just above the right curly bracket that terminates the namespace block).

#pragma once    

#include "genetic_code.hpp"
#include &lt;boost/format.hpp&gt;

namespace strom {

    class DataType {
        public:
                                            DataType();
                                            ~DataType();

            void                            setNucleotide();
            void                            setCodon();
            void                            setProtein();
            void                            setStandard();
        
            bool                            isNucleotide() const;
            bool                            isCodon() const;
            bool                            isProtein() const;
            bool                            isStandard() const;

            void                            setStandardNumStates(unsigned nstates);
            void                            setGeneticCodeFromName(std::string genetic_code_name);
            void                            setGeneticCode(GeneticCode::SharedPtr gcode);

            unsigned                        getDataType() const;
            unsigned                        getNumStates() const;
            std::string                     getDataTypeAsString() const;
            const GeneticCode::SharedPtr    getGeneticCode() const;
            
            static std::string              translateDataTypeToString(unsigned datatype);

        private:

            unsigned                        _datatype;
            unsigned                        _num_states;
            GeneticCode::SharedPtr          _genetic_code;
    };
    
    // member function bodies go here
    
}   

Constructor and destructor

The only thing out of the ordinary here is the fact that the constructor sets the data type to be nucleotide by default. This is done by calling the setNucleotide member function.

    inline DataType::DataType() : _datatype(0), _num_states(0) {    
        //std::cout &lt;&lt; "Creating a DataType object" &lt;&lt; std::endl;
        setNucleotide();
    }
    
    inline DataType::~DataType() {
        //std::cout &lt;&lt; "Destroying a DataType object" &lt;&lt; std::endl;
    }    

Setters

The four functions setNucleotide, setCodon, setProtein, and setStandard are used to change the data type to one of the four recognized types: nucleotide, codon, protein, and standard. The data member _datatype is set to a unique value for each data type, and _num_states is set to the appropriate number of states. The codon data type makes no sense without an associated genetic code, which is set to standard (i.e. the “universal” code) by default and can be changed later using the setGeneticCode member function. The number of states for the standard data type (not to be confused with the standard genetic code!) is set to 2 initially and can be dhanged later using the setStandardNumStates member function.

    inline void DataType::setNucleotide() {    
        _datatype = 1;
        _num_states = 4;
        _genetic_code = nullptr;
    }
    
    inline void DataType::setCodon() {
        _datatype = 2;
        _genetic_code = GeneticCode::SharedPtr(new GeneticCode("standard"));
        _num_states = _genetic_code-&gt;getNumNonStopCodons();
    }
    
    inline void DataType::setProtein() {
        _datatype = 3;
        _num_states = 20;
        _genetic_code = nullptr;
    }
    
    inline void DataType::setStandard() {
        _datatype = 4;
        _num_states = 2;
        _genetic_code = nullptr;
    }   

Querying the data type

The four member functions isNucleotide, isCodon, isProtein, and isStandard allow you to query the data type to determine which of the four possible data types it represents.

    inline bool DataType::isNucleotide() const {    
        return (_datatype == 1);
    }

    inline bool DataType::isCodon() const {
        return (_datatype == 2);
    }

    inline bool DataType::isProtein() const {
        return (_datatype == 3);
    }

    inline bool DataType::isStandard() const {
        return (_datatype == 4);
    }   

Changing the default genetic code and (for standard datatype) number of states

The setGeneticCodeFromName member function takes a string argument representing the genetic code to associate with the codon data type. The string suppled should be one of the 17 names returned by the GeneticCode::getRecognizedGeneticCodeNames function: standard, vertmito, yeastmito, moldmito, invertmito, ciliate, echinomito, euplotid, plantplastid , altyeast, ascidianmito, altflatwormmito, blepharismamacro, chlorophyceanmito , trematodemito, scenedesmusmito, and thraustochytriummito. These are defined as the keys of the static map GeneticCode::_definitions, which is initialized in the main.cpp file.

The setGeneticCode member function allows you to set the genetic code using an existing GeneticCode object by passing in a shared pointer to the existing object.

The setStandardNumStates member function takes an integer argument representing the number of discrete character states for a standard data type and sets _num_states to that value. Note that it also sets _datatype, which means you can use this function instead of setStandard, which sets _num_states to 2.

    inline void DataType::setGeneticCodeFromName(std::string genetic_code_name) {   
        assert(isCodon());
        _genetic_code = GeneticCode::SharedPtr(new GeneticCode(genetic_code_name));
    }
    
    inline void DataType::setGeneticCode(GeneticCode::SharedPtr gcode) {
        assert(isCodon());
        assert(gcode);
        _genetic_code = gcode;
    }

    inline void DataType::setStandardNumStates(unsigned nstates) {
        _datatype = 4;
        _num_states = nstates;
        _genetic_code = nullptr;
    }   

The getDataType, getNumStates, and getGeneticCode member functions

These are accessor functions that return the values stored in the private data members _datatype, _num_states, and _genetic_code, respectively.

    inline unsigned DataType::getDataType() const {   
        return _datatype;
    }
    
    inline unsigned DataType::getNumStates() const {
        return _num_states;
    }
    
    inline const GeneticCode::SharedPtr DataType::getGeneticCode() const {
        assert(isCodon());
        return _genetic_code;
    }   

The getDataTypeAsString and translateDataTypeToString functions

The translateDataTypeToString function is declared static, which means it can be called without reference to any particular DataType object. It is used to translate the arbitrary integer code for a datatype into a meaningful name (e.g., “nucleotide”, “codon”, “protein”, or “standard”).

The getDataTypeAsString function simply calls translateDataTypeToString, passing its stored _datatype value. This function can be used to query a particular DataType object for the name of its particular data type. For the codon data type, the string returned also specifies the name of the associated genetic code.

    inline std::string DataType::getDataTypeAsString() const {   
        std::string s = translateDataTypeToString(_datatype);
        if (isCodon())
            s += boost::str(boost::format(",%s") % _genetic_code-&gt;getGeneticCodeName());
        return s;
    }
    
    inline std::string DataType::translateDataTypeToString(unsigned datatype) {
        assert(datatype == 1 || datatype == 2 || datatype == 3 || datatype == 4);
        if (datatype == 1)
            return std::string("nucleotide");
        else if (datatype == 2)
            return std::string("codon");
        else if (datatype == 3)
            return std::string("protein");
        else
            return std::string("standard");
    }   

< 9.0 | 9.1 | 9.2 >