Before we can test the new Model
class, we must integrate it into the Likelihood
class.
Start by adding this include at the top of the likelihood.hpp file:
#pragma once
#include <map>
#include <boost/algorithm/string.hpp>
#include <boost/format.hpp>
#include <boost/shared_ptr.hpp>
#include <boost/range/adaptor/reversed.hpp>
#include "libhmsbeagle/beagle.h"
#include "tree.hpp"
#include "data.hpp"
<span style="color:#0000ff"><strong>#include "model.hpp"</strong></span>
#include "xstrom.hpp"
Add these two declarations to the public section of the Likelihood
class declaration:
Model::SharedPtr getModel();
void setModel(Model::SharedPtr m);
Here are the bodies of these two functions. It is assumed that the setModel
function will only be called before BeagleLib is instantiated, hence the assertion that the length of the _instances
vector is zero.
inline Model::SharedPtr Likelihood::getModel() {
return _model;
}
inline void Likelihood::setModel(Model::SharedPtr m) {
assert(_instances.size() == 0); // can't change model after initBeagleLib called
_model = m;
}
A data member invarmodel
should be added to the declaration of the InstanceInfo
struct in the private section of the Likelihood
class declaration.
struct InstanceInfo {
int handle;
int resourcenumber;
std::string resourcename;
unsigned nstates;
unsigned nratecateg;
unsigned npatterns;
unsigned partial_offset;
unsigned tmatrix_offset;
<span style="color:#0000ff"><strong>bool invarmodel;</strong></span>
std::vector<unsigned> subsets;
<span style="color:#0000ff"><strong>InstanceInfo() : handle(-1), resourcenumber(-1), resourcename(""), nstates(0), nratecateg(0), npatterns(0), partial_offset(0), tmatrix_offset(0), invarmodel(false) {}</strong></span>
};
This bool
will keep track of whether the model for a particular instance has an extra zero-rate category for invariable sites.
Be sure to also initialize the new variable in the InstanceInfo
constructor.
Add a new data member _model
to the private part of the Likelihood
class declaration.
Model::SharedPtr _model;
Initialize _model
in the clear
function:
_model = Model::SharedPtr(new Model());
Add the assert
highlighted in blue to ensure that _model
has been assigned prior to initBeagleLib
being called.
Instead of assuming 1 rate category (i.e. equal rates), make the changes and additions shown in blue. Note that invariable sites models add an additional rate category, so different instances of BeagleLib would be needed to handle a 4-category GTR+G model versus a 4-category GTR+I+G model. We need some way to distinguish an I+G model from a G model having the same number of rate categories. The way this is done here is to make the number of rate categories negative for I+G models and positive for G models.
The last change in blue simply adds information about invariable sites status when reporting on newly-created instances.
inline void Likelihood::initBeagleLib() {
assert(_data);
<span style="color:#0000ff"><strong>assert(_model);</strong></span>
// Close down any existing BeagleLib instances
finalizeBeagleLib(true);
_ntaxa = _data->getNumTaxa();
unsigned nsubsets = _data->getNumSubsets();
std::set<instance_pair_t> nstates_ncateg_combinations;
std::map<instance_pair_t, std::vector<unsigned> > subsets_for_pair;
for (unsigned subset = 0; subset < nsubsets; subset++) {
// Create a pair comprising number of states and number of rate categories
unsigned nstates = _data->getNumStatesForSubset(subset);
<span style="color:#0000ff"><strong>bool invar_model = _model->getSubsetIsInvarModel(subset);</strong></span>
<span style="color:#0000ff"><strong>int nrates = (invar_model ? -1 : 1)*_model->getSubsetNumCateg(subset);</strong></span>
instance_pair_t p = std::make_pair(nstates, nrates);
// Add combo to set
nstates_ncateg_combinations.insert(p);
subsets_for_pair[p].push_back(subset);
}
// Create one instance for each distinct nstates-nrates combination
_instances.clear();
for (auto p : nstates_ncateg_combinations) {
newInstance(p.first, p.second, subsets_for_pair[p]);
InstanceInfo & info = *_instances.rbegin();
<span style="color:#0000ff"><strong>std::cout << boost::str(boost::format("Created BeagleLib instance %d (%d states, %d rate%s, %d subset%s, %s invar. sites model)") % info.handle % info.nstates % info.nratecateg % (info.nratecateg == 1 ? "" : "s") % info.subsets.size() % (info.subsets.size() == 1 ? "" : "s") % (info.invarmodel ? "is" : "not")) << std::endl;</strong></span>
}
if (_ambiguity_equals_missing)
setTipStates();
else
setTipPartials();
setPatternWeights();
setPatternPartitionAssignments();
}
If the incoming nrates
is a negative number, it means that this is an invariable sites model, and the actual number of rate categories handled by BeagleLib should be the positive version of this, i.e. -nrates
. The changes in blue below set the new invarmodel
data member of the InstanceInfo
struct accordingly and make the number of categories positive if the incoming nrates
was negative.
inline void Likelihood::newInstance(unsigned nstates, int nrates, std::vector<unsigned> & subset_indices) {
unsigned num_subsets = (unsigned)subset_indices.size();
<span style="color:#0000ff"><strong>bool is_invar_model = (nrates < 0 ? true : false);</strong></span>
<span style="color:#0000ff"><strong>unsigned ngammacat = (unsigned)(is_invar_model ? -nrates : nrates);</strong></span>
unsigned num_patterns = 0;
for (auto s : subset_indices) {
num_patterns += _data->getNumPatternsInSubset(s);
}
unsigned num_internals = calcNumInternalsInFullyResolvedTree();
// add 1 to num_edges so that subroot node will have a tmatrix, root tip's tmatrix is never used
unsigned num_edges = calcNumEdgesInFullyResolvedTree();
unsigned num_nodes = num_edges + 1;
unsigned num_transition_probs = num_nodes*num_subsets;
long requirementFlags = 0;
long preferenceFlags = BEAGLE_FLAG_PRECISION_SINGLE | BEAGLE_FLAG_THREADING_CPP;
if (_prefer_gpu)
preferenceFlags |= BEAGLE_FLAG_PROCESSOR_GPU;
else
preferenceFlags |= BEAGLE_FLAG_PROCESSOR_CPU;
BeagleInstanceDetails instance_details;
unsigned npartials = num_internals + _ntaxa;
unsigned nsequences = 0;
if (_ambiguity_equals_missing) {
npartials -= _ntaxa;
nsequences += _ntaxa;
}
int inst = beagleCreateInstance(
_ntaxa, // tips
npartials, // partials
nsequences, // sequences
nstates, // states
num_patterns, // patterns (total across all subsets that use this instance)
num_subsets, // models (one for each distinct eigen decomposition)
num_subsets*num_transition_probs, // transition matrices (one for each node in each subset)
ngammacat, // rate categories
0, // scale buffers
NULL, // resource restrictions
0, // length of resource list
preferenceFlags, // preferred flags
requirementFlags, // required flags
&instance_details); // pointer for details
if (inst < 0) {
// beagleCreateInstance returns one of the following:
// valid instance (0, 1, 2, ...)
// error code (negative integer)
throw XStrom(boost::str(boost::format("Likelihood init function failed to create BeagleLib instance (BeagleLib error code was %d)") % _beagle_error[inst]));
}
InstanceInfo info;
info.handle = inst;
info.resourcenumber = instance_details.resourceNumber;
info.resourcename = instance_details.resourceName;
info.nstates = nstates;
info.nratecateg = ngammacat;
<span style="color:#0000ff"><strong>info.invarmodel = is_invar_model;</strong></span>
info.subsets = subset_indices;
info.npatterns = num_patterns;
info.partial_offset = num_internals;
info.tmatrix_offset = num_nodes;
_instances.push_back(info);
}
Remove the 2 lines below from setTipPartials
because we now allow models to have different numbers of states.
if (info.nstates != 4)
throw XStrom(boost::format("This program can handle only 4-state DNA/RNA data. You specified data having %d states for at least one data subset.") % info.nstates);
The setAmongSiteRateHeterogenetity
member function can be simplified now because most of the work can be done by member functions of the Model
class. Replace the current version of setAmongSiteRateHeterogenetity
with this version.
inline void Likelihood::setAmongSiteRateHeterogenetity() {
assert(_instances.size() > 0);
int code = 0;
// Loop through all instances
for (auto & info : _instances) {
// Loop through all subsets assigned to this instance
unsigned instance_specific_subset_index = 0;
for (unsigned s : info.subsets) {
code = _model->setBeagleAmongSiteRateVariationRates(info.handle, s, instance_specific_subset_index);
if (code != 0)
throw XStrom(boost::str(boost::format("Failed to set category rates for BeagleLib instance %d. BeagleLib error code was %d (%s)") % info.handle % code % _beagle_error[code]));
code = _model->setBeagleAmongSiteRateVariationProbs(info.handle, s, instance_specific_subset_index);
if (code != 0)
throw XStrom(boost::str(boost::format("Failed to set category probabilities for BeagleLib instance %d. BeagleLib error code was %d (%s)") % info.handle % code % _beagle_error[code]));
++instance_specific_subset_index;
}
}
}
We can also rewrite setModelRateMatrix
now using member functions of Model
.
inline void Likelihood::setModelRateMatrix() {
// Loop through all instances
for (auto & info : _instances) {
// Loop through all subsets assigned to this instance
unsigned instance_specific_subset_index = 0;
for (unsigned s : info.subsets) {
int code = _model->setBeagleStateFrequencies(info.handle, s, instance_specific_subset_index);
if (code != 0)
throw XStrom(boost::str(boost::format("Failed to set state frequencies for BeagleLib instance %d. BeagleLib error code was %d (%s)") % info.handle % code % _beagle_error[code]));
code = _model->setBeagleEigenDecomposition(info.handle, s, instance_specific_subset_index);
if (code != 0)
throw XStrom(boost::str(boost::format("Failed to set eigen decomposition for BeagleLib instance %d. BeagleLib error code was %d (%s)") % info.handle % code % _beagle_error[code]));
++instance_specific_subset_index;
}
}
}
It is possible to specify different relative rates for each partition subset. This allows the model to, for example, allow 3rd codon position sites to evolve at a faster rate than 1st or 2nd position sites. Subset-specific rates modify edge lengths. If a subset has a subset relative rate of 2, it is as if, for purposes of calculating the likelihood, the tree was twice as large as the tree read in from the file (i.e. every edge in the tree is twice as long as the corresponding edge in the tree file version). The functions defineOperations
and queueTMatrixRecalculation
are where edge lengths are used, and the following code snippets highlight the lines that need to change in order to use the subset relative rates stored by the model.
Here is the line that needs to be changed in defineOperations
:
inline void Likelihood::defineOperations(Tree::SharedPtr t) {
assert(_instances.size() > 0);
assert(t);
assert(t->isRooted() == _rooted);
<span style="color:#0000ff"><strong>_relrate_normalizing_constant = _model->calcNormalizingConstantForSubsetRelRates();</strong></span>
//...
And here are the lines that needs to be changed in queueTMatrixRecalculation
:
inline void Likelihood::queueTMatrixRecalculation(Node * nd) {
<span style="color:#0000ff"><strong>Model::subset_relrate_vect_t & subset_relrates = _model->getSubsetRelRates();</strong></span>
for (auto & info : _instances) {
unsigned instance_specific_subset_index = 0;
for (unsigned s : info.subsets) {
<span style="color:#0000ff"><strong>double subset_relative_rate = subset_relrates[s]/_relrate_normalizing_constant;</strong></span>
unsigned tindex = getTMatrixIndex(nd, info, instance_specific_subset_index);
_pmatrix_index[info.handle].push_back(tindex);
_edge_lengths[info.handle].push_back(nd->_edge_length*subset_relative_rate);
_eigen_indices[info.handle].push_back(s);
_category_rate_indices[info.handle].push_back(s);
++instance_specific_subset_index;
}
}
}
Adding the capability to accommodate invariable sites in models adds some complexity to the calculation of the log likelihood. We cannot simply rely on BeagleLib to do all the work if an invariable sites model is in effect. (That’s not exactly true, we could use BeagleLib for this, but doing so would be very inefficient as it would require full site likelihood calculations for the zero-rate case, which is a trivial calculation if done outside of BeagleLib). In the case of an I or I+G model, BeagleLib will handle everything except the zero-rate category. The code highlighted in blue below takes site log likelihoods calculated by BeagleLib and modifies them according to the proportion of invariable sites if an invariable sites model is being used.
inline double Likelihood::calcInstanceLogLikelihood(InstanceInfo & info, Tree::SharedPtr t) {
// ...
if (code != 0) {
throw XStrom(boost::str(boost::format("failed to calculate edge log-likelihoods in calcInstanceLogLikelihood. BeagleLib error code was %d (%s)") % code % _beagle_error[code]));
}
<span style="color:#0000ff"><strong>if (info.invarmodel) {</strong></span>
<span style="color:#0000ff"><strong>auto monomorphic = _data->getMonomorphic();</strong></span>
<span style="color:#0000ff"><strong>auto counts = _data->getPatternCounts();</strong></span>
<span style="color:#0000ff"><strong>std::vector<double> site_log_likelihoods(info.npatterns, 0.0);</strong></span>
<span style="color:#0000ff"><strong>double * siteLogLs = &site_log_likelihoods[0];</strong></span>
<span style="color:#0000ff"><strong></strong></span>
<span style="color:#0000ff"><strong>beagleGetSiteLogLikelihoods(info.handle, siteLogLs);</strong></span>
<span style="color:#0000ff"><strong></strong></span>
<span style="color:#0000ff"><strong>// Loop through all subsets assigned to this instance</strong></span>
<span style="color:#0000ff"><strong>double lnL = 0.0;</strong></span>
<span style="color:#0000ff"><strong>unsigned i = 0;</strong></span>
<span style="color:#0000ff"><strong>for (unsigned s : info.subsets) {</strong></span>
<span style="color:#0000ff"><strong>const ASRV & asrv = _model->getASRV(s);</strong></span>
<span style="color:#0000ff"><strong>const QMatrix & qmatrix = _model->getQMatrix(s);</strong></span>
<span style="color:#0000ff"><strong>const double * freq = qmatrix.getStateFreqs();</strong></span>
<span style="color:#0000ff"><strong></strong></span>
<span style="color:#0000ff"><strong>double pinvar = *(asrv.getPinvarSharedPtr());</strong></span>
<span style="color:#0000ff"><strong>assert(pinvar >= 0.0 && pinvar <= 1.0);</strong></span>
<span style="color:#0000ff"><strong></strong></span>
<span style="color:#0000ff"><strong>if (pinvar == 0.0) {</strong></span>
<span style="color:#0000ff"><strong>// log likelihood for this subset is equal to the sum of site log-likelihoods</strong></span>
<span style="color:#0000ff"><strong>auto interval = _data->getSubsetBeginEnd(s);</strong></span>
<span style="color:#0000ff"><strong>for (unsigned p = interval.first; p < interval.second; p++) {</strong></span>
<span style="color:#0000ff"><strong>lnL += counts[p]*site_log_likelihoods[i++];</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>else {</strong></span>
<span style="color:#0000ff"><strong>// Loop through all patterns in this subset</strong></span>
<span style="color:#0000ff"><strong>double log_pinvar = log(pinvar);</strong></span>
<span style="color:#0000ff"><strong>double log_one_minus_pinvar = log(1.0 - pinvar);</strong></span>
<span style="color:#0000ff"><strong>auto interval = _data->getSubsetBeginEnd(s);</strong></span>
<span style="color:#0000ff"><strong>for (unsigned p = interval.first; p < interval.second; p++) {</strong></span>
<span style="color:#0000ff"><strong>// Loop through all states for this pattern</strong></span>
<span style="color:#0000ff"><strong>double invar_like = 0.0;</strong></span>
<span style="color:#0000ff"><strong>if (monomorphic[p] > 0) {</strong></span>
<span style="color:#0000ff"><strong>for (unsigned k = 0; k < info.nstates; ++k) {</strong></span>
<span style="color:#0000ff"><strong>Data::state_t x = (Data::state_t)1 << k;</strong></span>
<span style="color:#0000ff"><strong>double condlike = (x & monomorphic[p] ? 1.0 : 0.0);</strong></span>
<span style="color:#0000ff"><strong>double basefreq = freq[k];</strong></span>
<span style="color:#0000ff"><strong>invar_like += condlike*basefreq;</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>double site_lnL = site_log_likelihoods[i++];</strong></span>
<span style="color:#0000ff"><strong>double log_like_term = log_one_minus_pinvar + site_lnL;</strong></span>
<span style="color:#0000ff"><strong>if (invar_like > 0.0) {</strong></span>
<span style="color:#0000ff"><strong>double log_invar_term = log_pinvar + log(invar_like);</strong></span>
<span style="color:#0000ff"><strong>double site_log_like = (log_like_term + log(1.0 + exp(log_invar_term - log_like_term)));</strong></span>
<span style="color:#0000ff"><strong>lnL += counts[p]*site_log_like;</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>else {</strong></span>
<span style="color:#0000ff"><strong>lnL += counts[p]*log_like_term;</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>log_likelihood = lnL;</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
return log_likelihood;
}
If the substitution rate is assumed to equal zero, then the likelihood is non-zero only if a site is constant because a variable site implies that at least one substitution occurred, which would be impossible if the rate were 0.0. Furthermore, we can assume that the likelihood of the entire tree given the root state is simply 1.0 because all leaves will have the root state with probability 1.0 if the rate is 0.0. The likelihood of the tree under this zero-rate category is thus just the probability of the root state, which is just the equilibrium frequency of the root state. Remember that the Data
class has a vector data member _monomorphic
that stores the state present at every monomorphic pattern (and stores 0 for sites that are not even potentially constant). This allows us to calculate the invariable site component of the likelihood, invar_like
, using this simple loop over states:
for (unsigned k = 0; k < info.nstates; ++k) {
Data::state_t x = (Data::state_t)1 << k;
double condlike = (x & monomorphic[p] ? 1.0 : 0.0);
double basefreq = freq[k];
invar_like += condlike*basefreq;
}
The overall site likelihood is calculated (using pseudocode) as follows:
L = pinvar*invar_like + (1 - pinvar)*site_like
The site log likelihood is then:
log(L) = log{pinvar*invar_like + (1 - pinvar)*site_like}
This creates a small predicament because BeagleLib provides log(site_like
) for us, yet the formula above involves site_like
, not log(site_like
). If we remove the log by exponentiating log(site_like
), we run afoul of the underflow problem. The trick is to factor out (1 - pinvar)*site_like
before changing to log scale:
L = pinvar*invar_like + (1 - pinvar)*site_like
= (1 - pinvar)*site_like*(pinvar*invar_like/((1 - pinvar)*site_like) + 1)
Taking the log of both sides leads to the formula used in the code:
double log_invar_term = log_pinvar + log(invar_like);
double site_log_like = (log_like_term + log(1.0 + exp(log_invar_term - log_like_term)));
In addition to asserting that _data
exists, we should now also assert that _model
exists before attempting to calculate the likelihood.
inline double Likelihood::calcLogLikelihood(Tree::SharedPtr t) {
assert(_instances.size() > 0);
if (!_using_data)
return 0.0;
// Must call setData and setModel before calcLogLikelihood
assert(_data);
<span style="color:#0000ff"><strong>assert(_model);</strong></span>
if (t->_is_rooted)
throw XStrom("This version of the program can only compute likelihoods for unrooted trees");
// Assuming "root" is leaf 0
assert(t->_root->_number == 0 && t->_root->_left_child == t->_preorder[0] && !t->_preorder[0]->_right_sib);
setModelRateMatrix();
setAmongSiteRateHeterogenetity();
defineOperations(t);
updateTransitionMatrices();
calculatePartials();
double log_likelihood = 0.0;
for (auto & info : _instances) {
log_likelihood += calcInstanceLogLikelihood(info, t);
}
return log_likelihood;
}