11.5 Adding the Model to the Likelihood class

(Win version)

< 11.4 | 11.5 | 11.6 >

Before we can test the new Model class, we must integrate it into the Likelihood class.

Start by adding this include at the top of the likelihood.hpp file:

#pragma once    

#include &lt;map&gt;
#include &lt;boost/algorithm/string.hpp&gt;
#include &lt;boost/format.hpp&gt;
#include &lt;boost/shared_ptr.hpp&gt;
#include &lt;boost/range/adaptor/reversed.hpp&gt;
#include "libhmsbeagle/beagle.h"
#include "tree.hpp"
#include "data.hpp"
<span style="color:#0000ff"><strong>#include "model.hpp"</strong></span>
#include "xstrom.hpp"

Add getModel and setModel member functions

Add these two declarations to the public section of the Likelihood class declaration:

            Model::SharedPtr                        getModel();                     
            void                                    setModel(Model::SharedPtr m);   

Here are the bodies of these two functions. It is assumed that the setModel function will only be called before BeagleLib is instantiated, hence the assertion that the length of the _instances vector is zero.

    inline Model::SharedPtr Likelihood::getModel() {  
        return _model;
    }
    
    inline void Likelihood::setModel(Model::SharedPtr m) {
        assert(_instances.size() == 0); // can't change model after initBeagleLib called
        _model = m;
    }   

Add an invarmodel data member to the InstanceInfo class

A data member invarmodel should be added to the declaration of the InstanceInfo struct in the private section of the Likelihood class declaration.

            struct InstanceInfo {   
                int handle;
                int resourcenumber;
                std::string resourcename;
                unsigned nstates;
                unsigned nratecateg;
                unsigned npatterns;
                unsigned partial_offset;
                unsigned tmatrix_offset;
                <span style="color:#0000ff"><strong>bool invarmodel;</strong></span>
                std::vector&lt;unsigned&gt; subsets;
                
                <span style="color:#0000ff"><strong>InstanceInfo() : handle(-1), resourcenumber(-1), resourcename(""), nstates(0), nratecateg(0), npatterns(0), partial_offset(0), tmatrix_offset(0), invarmodel(false) {}</strong></span>
            };   

This bool will keep track of whether the model for a particular instance has an extra zero-rate category for invariable sites.

Be sure to also initialize the new variable in the InstanceInfo constructor.

New data member _model

Add a new data member _model to the private part of the Likelihood class declaration.

            Model::SharedPtr                        _model;

Initialize _model in the clear function:

        _model = Model::SharedPtr(new Model());

Modify the initBeagleLib function

Add the assert highlighted in blue to ensure that _model has been assigned prior to initBeagleLib being called.

Instead of assuming 1 rate category (i.e. equal rates), make the changes and additions shown in blue. Note that invariable sites models add an additional rate category, so different instances of BeagleLib would be needed to handle a 4-category GTR+G model versus a 4-category GTR+I+G model. We need some way to distinguish an I+G model from a G model having the same number of rate categories. The way this is done here is to make the number of rate categories negative for I+G models and positive for G models.

The last change in blue simply adds information about invariable sites status when reporting on newly-created instances.

    inline void Likelihood::initBeagleLib() {   
        assert(_data);
        <span style="color:#0000ff"><strong>assert(_model);</strong></span>

        // Close down any existing BeagleLib instances
        finalizeBeagleLib(true);

        _ntaxa = _data-&gt;getNumTaxa();
        
        unsigned nsubsets = _data-&gt;getNumSubsets();
        std::set&lt;instance_pair_t&gt; nstates_ncateg_combinations;
        std::map&lt;instance_pair_t, std::vector&lt;unsigned&gt; &gt; subsets_for_pair;
        for (unsigned subset = 0; subset &lt; nsubsets; subset++) {
            // Create a pair comprising number of states and number of rate categories
            unsigned nstates = _data-&gt;getNumStatesForSubset(subset);
            <span style="color:#0000ff"><strong>bool invar_model = _model-&gt;getSubsetIsInvarModel(subset);</strong></span>
            <span style="color:#0000ff"><strong>int nrates = (invar_model ? -1 : 1)*_model-&gt;getSubsetNumCateg(subset);</strong></span>
            instance_pair_t p = std::make_pair(nstates, nrates);
            
            // Add combo to set
            nstates_ncateg_combinations.insert(p);
            subsets_for_pair[p].push_back(subset);
        }

        // Create one instance for each distinct nstates-nrates combination
        _instances.clear();
        for (auto p : nstates_ncateg_combinations) {
            newInstance(p.first, p.second, subsets_for_pair[p]);
            
            InstanceInfo & info = *_instances.rbegin();
            <span style="color:#0000ff"><strong>std::cout &lt;&lt; boost::str(boost::format("Created BeagleLib instance %d (%d states, %d rate%s, %d subset%s, %s invar. sites model)") % info.handle % info.nstates % info.nratecateg % (info.nratecateg == 1 ? "" : "s") % info.subsets.size() % (info.subsets.size() == 1 ? "" : "s") % (info.invarmodel ? "is" : "not")) &lt;&lt; std::endl;</strong></span>
        }
        
        if (_ambiguity_equals_missing)
            setTipStates();
        else
            setTipPartials();
        setPatternWeights();
        setPatternPartitionAssignments();
    }   

Modifications to the newInstance member function

If the incoming nrates is a negative number, it means that this is an invariable sites model, and the actual number of rate categories handled by BeagleLib should be the positive version of this, i.e. -nrates. The changes in blue below set the new invarmodel data member of the InstanceInfo struct accordingly and make the number of categories positive if the incoming nrates was negative.

    inline void Likelihood::newInstance(unsigned nstates, int nrates, std::vector&lt;unsigned&gt; & subset_indices) { 
        unsigned num_subsets = (unsigned)subset_indices.size();
        
        <span style="color:#0000ff"><strong>bool is_invar_model = (nrates &lt; 0 ? true : false);</strong></span>
        <span style="color:#0000ff"><strong>unsigned ngammacat = (unsigned)(is_invar_model ? -nrates : nrates);</strong></span>
        
        unsigned num_patterns = 0;
        for (auto s : subset_indices) {
            num_patterns += _data-&gt;getNumPatternsInSubset(s);
        }
        
        unsigned num_internals = calcNumInternalsInFullyResolvedTree();

        // add 1 to num_edges so that subroot node will have a tmatrix, root tip's tmatrix is never used
        unsigned num_edges = calcNumEdgesInFullyResolvedTree();
        unsigned num_nodes = num_edges + 1;
        unsigned num_transition_probs = num_nodes*num_subsets;
        
        long requirementFlags = 0;

        long preferenceFlags = BEAGLE_FLAG_PRECISION_SINGLE | BEAGLE_FLAG_THREADING_CPP;
        if (_prefer_gpu)
            preferenceFlags |= BEAGLE_FLAG_PROCESSOR_GPU;
        else
            preferenceFlags |= BEAGLE_FLAG_PROCESSOR_CPU;
        
        BeagleInstanceDetails instance_details;
        unsigned npartials = num_internals + _ntaxa;
        unsigned nsequences = 0;
        if (_ambiguity_equals_missing) {
            npartials -= _ntaxa;
            nsequences += _ntaxa;
        }
        
        int inst = beagleCreateInstance(
             _ntaxa,                           // tips
             npartials,                        // partials
             nsequences,                       // sequences
             nstates,                          // states
             num_patterns,                     // patterns (total across all subsets that use this instance)
             num_subsets,                      // models (one for each distinct eigen decomposition)
             num_subsets*num_transition_probs, // transition matrices (one for each node in each subset)
             ngammacat,                        // rate categories
             0,                                // scale buffers 
             NULL,                             // resource restrictions
             0,                                // length of resource list
             preferenceFlags,                  // preferred flags
             requirementFlags,                 // required flags
             &instance_details);               // pointer for details
        
        if (inst &lt; 0) {
            // beagleCreateInstance returns one of the following:
            //   valid instance (0, 1, 2, ...)
            //   error code (negative integer)
            throw XStrom(boost::str(boost::format("Likelihood init function failed to create BeagleLib instance (BeagleLib error code was %d)") % _beagle_error[inst]));
        }
        
        InstanceInfo info;
        info.handle         = inst;
        info.resourcenumber = instance_details.resourceNumber;
        info.resourcename   = instance_details.resourceName;
        info.nstates        = nstates;
        info.nratecateg     = ngammacat;
        <span style="color:#0000ff"><strong>info.invarmodel     = is_invar_model;</strong></span>
        info.subsets        = subset_indices;
        info.npatterns      = num_patterns;
        info.partial_offset = num_internals;
        info.tmatrix_offset = num_nodes;
        _instances.push_back(info);
    } 

Unconstrain the setTipPartials function

Remove the 2 lines below from setTipPartials because we now allow models to have different numbers of states.

            if (info.nstates != 4)  
                throw XStrom(boost::format("This program can handle only 4-state DNA/RNA data. You specified data having %d states for at least one data subset.") % info.nstates);  

Simplify the setAmongSiteRateHeterogenetity function

The setAmongSiteRateHeterogenetity member function can be simplified now because most of the work can be done by member functions of the Model class. Replace the current version of setAmongSiteRateHeterogenetity with this version.

    inline void Likelihood::setAmongSiteRateHeterogenetity() {  
        assert(_instances.size() &gt; 0);
        int code = 0;
        
        // Loop through all instances
        for (auto & info : _instances) {

            // Loop through all subsets assigned to this instance
            unsigned instance_specific_subset_index = 0;
            for (unsigned s : info.subsets) {
                code = _model-&gt;setBeagleAmongSiteRateVariationRates(info.handle, s, instance_specific_subset_index);
                if (code != 0)
                    throw XStrom(boost::str(boost::format("Failed to set category rates for BeagleLib instance %d. BeagleLib error code was %d (%s)") % info.handle % code % _beagle_error[code]));
            
                code = _model-&gt;setBeagleAmongSiteRateVariationProbs(info.handle, s, instance_specific_subset_index);
                if (code != 0)
                    throw XStrom(boost::str(boost::format("Failed to set category probabilities for BeagleLib instance %d. BeagleLib error code was %d (%s)") % info.handle % code % _beagle_error[code]));
                    
                ++instance_specific_subset_index;
            }
        }
    }   

Simplify setModelRateMatrix function

We can also rewrite setModelRateMatrix now using member functions of Model.

    inline void Likelihood::setModelRateMatrix() { 
        // Loop through all instances
        for (auto & info : _instances) {

            // Loop through all subsets assigned to this instance
            unsigned instance_specific_subset_index = 0;
            for (unsigned s : info.subsets) {
                int code = _model-&gt;setBeagleStateFrequencies(info.handle, s, instance_specific_subset_index);
                if (code != 0)
                    throw XStrom(boost::str(boost::format("Failed to set state frequencies for BeagleLib instance %d. BeagleLib error code was %d (%s)") % info.handle % code % _beagle_error[code]));

                code = _model-&gt;setBeagleEigenDecomposition(info.handle, s, instance_specific_subset_index);
                if (code != 0)
                    throw XStrom(boost::str(boost::format("Failed to set eigen decomposition for BeagleLib instance %d. BeagleLib error code was %d (%s)") % info.handle % code % _beagle_error[code]));
                
                ++instance_specific_subset_index;
            }
        }
    } 

Modify the defineOperations function to accommodate subset relative rates

It is possible to specify different relative rates for each partition subset. This allows the model to, for example, allow 3rd codon position sites to evolve at a faster rate than 1st or 2nd position sites. Subset-specific rates modify edge lengths. If a subset has a subset relative rate of 2, it is as if, for purposes of calculating the likelihood, the tree was twice as large as the tree read in from the file (i.e. every edge in the tree is twice as long as the corresponding edge in the tree file version). The functions defineOperations and queueTMatrixRecalculation are where edge lengths are used, and the following code snippets highlight the lines that need to change in order to use the subset relative rates stored by the model.

Here is the line that needs to be changed in defineOperations:

    inline void Likelihood::defineOperations(Tree::SharedPtr t) {
        assert(_instances.size() &gt; 0);
        assert(t);
        assert(t-&gt;isRooted() == _rooted);

        <span style="color:#0000ff"><strong>_relrate_normalizing_constant = _model-&gt;calcNormalizingConstantForSubsetRelRates();</strong></span>
        //... 

And here are the lines that needs to be changed in queueTMatrixRecalculation:

    inline void Likelihood::queueTMatrixRecalculation(Node * nd) {  
        <span style="color:#0000ff"><strong>Model::subset_relrate_vect_t & subset_relrates = _model-&gt;getSubsetRelRates();</strong></span>
        for (auto & info : _instances) {
            unsigned instance_specific_subset_index = 0;
            for (unsigned s : info.subsets) {
                <span style="color:#0000ff"><strong>double subset_relative_rate = subset_relrates[s]/_relrate_normalizing_constant;</strong></span>

                unsigned tindex = getTMatrixIndex(nd, info, instance_specific_subset_index);
                _pmatrix_index[info.handle].push_back(tindex);
                _edge_lengths[info.handle].push_back(nd-&gt;_edge_length*subset_relative_rate);
                _eigen_indices[info.handle].push_back(s);
                _category_rate_indices[info.handle].push_back(s);

                ++instance_specific_subset_index;
            }
        }
    }   

Modify calcInstanceLogLikelihood to accommodate invariable sites models

Adding the capability to accommodate invariable sites in models adds some complexity to the calculation of the log likelihood. We cannot simply rely on BeagleLib to do all the work if an invariable sites model is in effect. (That’s not exactly true, we could use BeagleLib for this, but doing so would be very inefficient as it would require full site likelihood calculations for the zero-rate case, which is a trivial calculation if done outside of BeagleLib). In the case of an I or I+G model, BeagleLib will handle everything except the zero-rate category. The code highlighted in blue below takes site log likelihoods calculated by BeagleLib and modifies them according to the proportion of invariable sites if an invariable sites model is being used.

    inline double Likelihood::calcInstanceLogLikelihood(InstanceInfo & info, Tree::SharedPtr t) {   
        // ... 

        if (code != 0) {
            throw XStrom(boost::str(boost::format("failed to calculate edge log-likelihoods in calcInstanceLogLikelihood. BeagleLib error code was %d (%s)") % code % _beagle_error[code]));
        }
        
        <span style="color:#0000ff"><strong>if (info.invarmodel) {</strong></span>
            <span style="color:#0000ff"><strong>auto monomorphic = _data-&gt;getMonomorphic();</strong></span>
            <span style="color:#0000ff"><strong>auto counts = _data-&gt;getPatternCounts();</strong></span>
            <span style="color:#0000ff"><strong>std::vector&lt;double&gt; site_log_likelihoods(info.npatterns, 0.0);</strong></span>
            <span style="color:#0000ff"><strong>double * siteLogLs = &site_log_likelihoods[0];</strong></span>

<span style="color:#0000ff"><strong></strong></span>
            <span style="color:#0000ff"><strong>beagleGetSiteLogLikelihoods(info.handle, siteLogLs);</strong></span>

<span style="color:#0000ff"><strong></strong></span>
            <span style="color:#0000ff"><strong>// Loop through all subsets assigned to this instance</strong></span>
            <span style="color:#0000ff"><strong>double lnL = 0.0;</strong></span>
            <span style="color:#0000ff"><strong>unsigned i = 0;</strong></span>
            <span style="color:#0000ff"><strong>for (unsigned s : info.subsets) {</strong></span>
                <span style="color:#0000ff"><strong>const ASRV & asrv = _model-&gt;getASRV(s);</strong></span>
                <span style="color:#0000ff"><strong>const QMatrix & qmatrix = _model-&gt;getQMatrix(s);</strong></span>
                <span style="color:#0000ff"><strong>const double * freq = qmatrix.getStateFreqs();</strong></span>
                
<span style="color:#0000ff"><strong></strong></span>
                <span style="color:#0000ff"><strong>double pinvar = *(asrv.getPinvarSharedPtr());</strong></span>
                <span style="color:#0000ff"><strong>assert(pinvar &gt;= 0.0 && pinvar &lt;= 1.0);</strong></span>

<span style="color:#0000ff"><strong></strong></span>
                <span style="color:#0000ff"><strong>if (pinvar == 0.0) {</strong></span>
                    <span style="color:#0000ff"><strong>// log likelihood for this subset is equal to the sum of site log-likelihoods</strong></span>
                    <span style="color:#0000ff"><strong>auto interval = _data-&gt;getSubsetBeginEnd(s);</strong></span>
                    <span style="color:#0000ff"><strong>for (unsigned p = interval.first; p &lt; interval.second; p++) {</strong></span>
                        <span style="color:#0000ff"><strong>lnL += counts[p]*site_log_likelihoods[i++];</strong></span>
                    <span style="color:#0000ff"><strong>}</strong></span>
                <span style="color:#0000ff"><strong>}</strong></span>
                <span style="color:#0000ff"><strong>else {</strong></span>
                    <span style="color:#0000ff"><strong>// Loop through all patterns in this subset</strong></span>
                    <span style="color:#0000ff"><strong>double log_pinvar = log(pinvar);</strong></span>
                    <span style="color:#0000ff"><strong>double log_one_minus_pinvar = log(1.0 - pinvar);</strong></span>
                    <span style="color:#0000ff"><strong>auto interval = _data-&gt;getSubsetBeginEnd(s);</strong></span>
                    <span style="color:#0000ff"><strong>for (unsigned p = interval.first; p &lt; interval.second; p++) {</strong></span>
                        <span style="color:#0000ff"><strong>// Loop through all states for this pattern</strong></span>
                        <span style="color:#0000ff"><strong>double invar_like = 0.0;</strong></span>
                        <span style="color:#0000ff"><strong>if (monomorphic[p] &gt; 0) {</strong></span>
                            <span style="color:#0000ff"><strong>for (unsigned k = 0; k &lt; info.nstates; ++k) {</strong></span>
                                <span style="color:#0000ff"><strong>Data::state_t x = (Data::state_t)1 &lt;&lt; k;</strong></span>
                                <span style="color:#0000ff"><strong>double condlike = (x & monomorphic[p] ? 1.0 : 0.0);</strong></span>
                                <span style="color:#0000ff"><strong>double basefreq = freq[k];</strong></span>
                                <span style="color:#0000ff"><strong>invar_like += condlike*basefreq;</strong></span>
                            <span style="color:#0000ff"><strong>}</strong></span>
                        <span style="color:#0000ff"><strong>}</strong></span>
                        <span style="color:#0000ff"><strong>double site_lnL = site_log_likelihoods[i++];</strong></span>
                        <span style="color:#0000ff"><strong>double log_like_term = log_one_minus_pinvar + site_lnL;</strong></span>
                        <span style="color:#0000ff"><strong>if (invar_like &gt; 0.0) {</strong></span>
                            <span style="color:#0000ff"><strong>double log_invar_term = log_pinvar + log(invar_like);</strong></span>
                            <span style="color:#0000ff"><strong>double site_log_like = (log_like_term + log(1.0 + exp(log_invar_term - log_like_term)));</strong></span>
                            <span style="color:#0000ff"><strong>lnL += counts[p]*site_log_like;</strong></span>
                        <span style="color:#0000ff"><strong>}</strong></span>
                        <span style="color:#0000ff"><strong>else {</strong></span>
                            <span style="color:#0000ff"><strong>lnL += counts[p]*log_like_term;</strong></span>
                        <span style="color:#0000ff"><strong>}</strong></span>
                    <span style="color:#0000ff"><strong>}</strong></span>
                <span style="color:#0000ff"><strong>}</strong></span>
            <span style="color:#0000ff"><strong>}</strong></span>
            <span style="color:#0000ff"><strong>log_likelihood = lnL;</strong></span>
        <span style="color:#0000ff"><strong>}</strong></span>

        return log_likelihood;
    } 

If the substitution rate is assumed to equal zero, then the likelihood is non-zero only if a site is constant because a variable site implies that at least one substitution occurred, which would be impossible if the rate were 0.0. Furthermore, we can assume that the likelihood of the entire tree given the root state is simply 1.0 because all leaves will have the root state with probability 1.0 if the rate is 0.0. The likelihood of the tree under this zero-rate category is thus just the probability of the root state, which is just the equilibrium frequency of the root state. Remember that the Data class has a vector data member _monomorphic that stores the state present at every monomorphic pattern (and stores 0 for sites that are not even potentially constant). This allows us to calculate the invariable site component of the likelihood, invar_like, using this simple loop over states:

                            for (unsigned k = 0; k &lt; info.nstates; ++k) {   
                                Data::state_t x = (Data::state_t)1 &lt;&lt; k;
                                double condlike = (x & monomorphic[p] ? 1.0 : 0.0);
                                double basefreq = freq[k];
                                invar_like += condlike*basefreq;
                            }   

The overall site likelihood is calculated (using pseudocode) as follows:

L = pinvar*invar_like + (1 - pinvar)*site_like

The site log likelihood is then:

log(L) = log{pinvar*invar_like + (1 - pinvar)*site_like}

This creates a small predicament because BeagleLib provides log(site_like) for us, yet the formula above involves site_like, not log(site_like). If we remove the log by exponentiating log(site_like), we run afoul of the underflow problem. The trick is to factor out (1 - pinvar)*site_like before changing to log scale:

L = pinvar*invar_like + (1 - pinvar)*site_like
  = (1 - pinvar)*site_like*(pinvar*invar_like/((1 - pinvar)*site_like) + 1)

Taking the log of both sides leads to the formula used in the code:

                            double log_invar_term = log_pinvar + log(invar_like); 
                            double site_log_like = (log_like_term + log(1.0 + exp(log_invar_term - log_like_term))); 

Asserting that the model has been specified in calcLogLikelihood

In addition to asserting that _data exists, we should now also assert that _model exists before attempting to calculate the likelihood.

    inline double Likelihood::calcLogLikelihood(Tree::SharedPtr t) {    
        assert(_instances.size() &gt; 0);
        
        if (!_using_data)
            return 0.0;

        // Must call setData and setModel before calcLogLikelihood
        assert(_data);
        <span style="color:#0000ff"><strong>assert(_model);</strong></span>
        
        if (t-&gt;_is_rooted)
            throw XStrom("This version of the program can only compute likelihoods for unrooted trees");

        // Assuming "root" is leaf 0
        assert(t-&gt;_root-&gt;_number == 0 && t-&gt;_root-&gt;_left_child == t-&gt;_preorder[0] && !t-&gt;_preorder[0]-&gt;_right_sib);

        setModelRateMatrix();
        setAmongSiteRateHeterogenetity();
        defineOperations(t);
        updateTransitionMatrices();
        calculatePartials();

        double log_likelihood = 0.0;
        for (auto & info : _instances) {
            log_likelihood += calcInstanceLogLikelihood(info, t);
        }
        
        return log_likelihood;
    }