To test the Likelihood
class, we need both a data set and a tree. You have already created a data file named rbcL.nex. Here is a tree file containing the maximum likelihood tree for the data in rbcL.nex. The log-likelihood of this tree under the JC69 model is -278.83767
.
Create a file containing the text below and name it rbcLjc.tre. This file does not need to be in your project, so you can use any text editor to create it (e.g. BBEdit).
Here are the contents of the file:
#nexus
begin trees;
translate
1 Atractomorpha_echinata,
2 Bracteacoccus_aerius,
3 Bracteacoccus_minor,
4 Chlorotetraedron_incus,
5 Chromochloris_zofingiensis,
6 Kirchneriella_aperta,
7 Mychonastes_homosphaera,
8 Neochloris_aquatica,
9 Ourococcus_multisporus,
10 Pediastrum_duplex,
11 Pseudomuriella_schumacherensis,
12 Rotundella_rotunda,
13 Scenedesmus_obliquus,
14 Stigeoclonium_helveticum
;
tree 'maxlike' = [&U] (1:0.052781,((((((((((2:0,3:0.051745):0.016986,5:0.070562):0.017071,(6:0.047517,11:0.067138):0.025270):0.016985,9:0):0.016955,4:0.034406):0.016955,7:0.051745):0,13:0):0.016955,12:0.016955):0.034406,8:0):0,10:0):0.052781,14:0.071604);
end;
You will need to specify the data file and tree file either on the command line (using --datafile rbcL.nex --treefile rbcLjc.tre
) or create a text file named strom.conf containing these two lines:
datafile = rbcL.nex
treefile = rbcLjc.tre
Add the highlighted lines to the Strom
class declaration in strom.hpp.
#pragma once
#include <iostream>
#include "data.hpp"
<span style="color:#0000ff"><strong>#include "likelihood.hpp"</strong></span>
#include "tree_summary.hpp"
#include "partition.hpp"
#include <boost/program_options.hpp>
#include <boost/filesystem.hpp>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
namespace strom {
class Strom {
public:
Strom();
~Strom();
void clear();
void processCommandLineOptions(int argc, const char * argv[]);
void run();
private:
std::string _data_file_name;
std::string _tree_file_name;
Partition::SharedPtr _partition;
Data::SharedPtr _data;
<span style="color:#0000ff"><strong>Likelihood::SharedPtr _likelihood;</strong></span>
TreeSummary::SharedPtr _tree_summary;
<span style="color:#0000ff"><strong>bool _use_gpu;</strong></span>
<span style="color:#0000ff"><strong>bool _ambig_missing;</strong></span>
static std::string _program_name;
static unsigned _major_version;
static unsigned _minor_version;
};
Initialize gpu
and ambigmissing
in the clear
function:
inline void Strom::clear() {
_data_file_name = "";
_tree_file_name = "";
_tree_summary = nullptr;
_partition.reset(new Partition());
<span style="color:#0000ff"><strong>_use_gpu = true;</strong></span>
<span style="color:#0000ff"><strong>_ambig_missing = true;</strong></span>
_data = nullptr;
_likelihood = nullptr;
}
Add the required()
call in Strom::processCommandLineOptions
to the treefile
command line option and add options for gpu
and ambigmissing
:
inline void Strom::processCommandLineOptions(int argc, const char * argv[]) {
std::vector<std::string> partition_subsets;
boost::program_options::variables_map vm;
boost::program_options::options_description desc("Allowed options");
desc.add_options()
("help,h", "produce help message")
("version,v", "show program version")
("datafile,d", boost::program_options::value(&_data_file_name)->required(), "name of a data file in NEXUS format")
<span style="color:#0000ff"><strong>("treefile,t", boost::program_options::value(&_tree_file_name)->required(), "name of a tree file in NEXUS format")</strong></span>
("subset", boost::program_options::value(&partition_subsets), "a string defining a partition subset, e.g. 'first:1-1234\3' or 'default[codon:standard]:1-3702'")
<span style="color:#0000ff"><strong>("gpu", boost::program_options::value(&_use_gpu)->default_value(true), "use GPU if available")</strong></span>
<span style="color:#0000ff"><strong>("ambigmissing", boost::program_options::value(&_ambig_missing)->default_value(true), "treat all ambiguities as missing data")</strong></span>
;
boost::program_options::store(boost::program_options::parse_command_line(argc, argv, desc), vm);
try {
const boost::program_options::parsed_options & parsed = boost::program_options::parse_config_file< char >("strom.conf", desc, false);
boost::program_options::store(parsed, vm);
}
catch(boost::program_options::reading_file & x) {
std::cout << "Note: configuration file (strom.conf) not found" << std::endl;
}
boost::program_options::notify(vm);
// If user specified --help on command line, output usage summary and quit
if (vm.count("help") > 0) {
std::cout << desc << "\n";
std::exit(1);
}
// If user specified --version on command line, output version and quit
if (vm.count("version") > 0) {
std::cout << boost::str(boost::format("This is %s version %d.%d") % _program_name % _major_version % _minor_version) << std::endl;
std::exit(1);
}
// If user specified --subset on command line, break specified partition subset
// definition into name and character set string and add to _partition
if (vm.count("subset") > 0) {
_partition.reset(new Partition());
for (auto s : partition_subsets) {
_partition->parseSubsetDefinition(s);
}
}
}
You could remove the required()
call from all command line options; I only use it here because this means program_options
takes care of informing the user if they forget to specify a data or tree file on the command line or in the strom.conf file. Because the program, as currently configured, will only work if given both a data file and a tree file, it makes sense to require these options.
Replace the body of the Strom::run
function in the file strom.hpp with this revised version.
inline void Strom::run() {
std::cout << "Starting..." << std::endl;
std::cout << "Current working directory: " << boost::filesystem::current_path() << std::endl;
try {
std::cout << "\n*** Reading and storing the data in the file " << _data_file_name << std::endl;
_data = Data::SharedPtr(new Data());
_data->setPartition(_partition);
_data->getDataFromFile(_data_file_name);
// Report information about data partition subsets
unsigned nsubsets = _data->getNumSubsets();
std::cout << "\nNumber of taxa: " << _data->getNumTaxa() << std::endl;
std::cout << "Number of partition subsets: " << nsubsets << std::endl;
for (unsigned subset = 0; subset < nsubsets; subset++) {
DataType dt = _partition->getDataTypeForSubset(subset);
std::cout << " Subset " << (subset+1) << " (" << _data->getSubsetName(subset) << ")" << std::endl;
std::cout << " data type: " << dt.getDataTypeAsString() << std::endl;
std::cout << " sites: " << _data->calcSeqLenInSubset(subset) << std::endl;
std::cout << " patterns: " << _data->getNumPatternsInSubset(subset) << std::endl;
std::cout << " ambiguity: " << (_ambig_missing || dt.isCodon() ? "treated as missing data (faster)" : "handled appropriately (slower)") << std::endl;
}
std::cout << "\n*** Resources available to BeagleLib " << _likelihood->beagleLibVersion() << ":\n";
std::cout << _likelihood->availableResources() << std::endl;
std::cout << "\n*** Creating the likelihood calculator" << std::endl;
_likelihood = Likelihood::SharedPtr(new Likelihood());
_likelihood->setPreferGPU(_use_gpu);
_likelihood->setAmbiguityEqualsMissing(_ambig_missing);
_likelihood->setData(_data);
_likelihood->initBeagleLib();
std::cout << "\n*** Reading and storing the first tree in the file " << _tree_file_name << std::endl;
_tree_summary = TreeSummary::SharedPtr(new TreeSummary());
_tree_summary->readTreefile(_tree_file_name, 0);
Tree::SharedPtr tree = _tree_summary->getTree(0);
if (tree->numLeaves() != _data->getNumTaxa())
throw XStrom(boost::format("Number of taxa in tree (%d) does not equal the number of taxa in the data matrix (%d)") % tree->numLeaves() % _data->getNumTaxa());
std::cout << "\n*** Calculating the likelihood of the tree" << std::endl;
double lnL = _likelihood->calcLogLikelihood(tree);
std::cout << boost::str(boost::format("log likelihood = %.5f") % lnL) << std::endl;
std::cout << " (expecting -278.83767)" << std::endl;
}
catch (XStrom & x) {
std::cerr << "Strom encountered a problem:\n " << x.what() << std::endl;
}
std::cout << "\nFinished!" << std::endl;
}
The defineOperations
member function body above accesses private data members of both the Tree
and Node
classes. To avoid compile-time errors, you will thus need to declare the Likelihood
class to be a friend of both Tree
and Node
. In the node.hpp file, you should uncomment the 2 lines highlighted below:
#pragma once
#include <string>
#include <vector>
#include <iostream>
#include "split.hpp"
namespace strom {
class Tree;
class TreeManip;
<span style="color:#0000ff"><strong>class Likelihood;</strong></span>
//class Updater;
class Node {
friend class Tree;
friend class TreeManip;
<span style="color:#0000ff"><strong>friend class Likelihood;</strong></span>
//friend class Updater;
Uncomment the same 2 lines in the tree.hpp file:
#pragma once
#include <memory>
#include <iostream>
#include "node.hpp"
namespace strom {
class TreeManip;
<span style="color:#0000ff"><strong>class Likelihood;</strong></span>
//class Updater;
class Tree {
friend class TreeManip;
<span style="color:#0000ff"><strong>friend class Likelihood;</strong></span>
//friend class Updater;
Our Likelihood class depends on BeagleLib to do all the heavy lifting with respect to calculating transition probabilities and the log-likelihood itself. Our next task is thus to install BeagleLib and tell our program where to find the shared library for linking and at run time.
Start by downloading the file beagle-lib-master.zip from the BeagleLib GitHub site. Unzip the library and move the resulting directory, beagle-lib-master
, to your libraries
directory, which I will assume is ~/Documents/libraries.
Open a terminal window and navigate into your beagle-lib-master
directory:
cd ~/Documents/libraries/beagle-lib-master
Now build the library, specifying the following on the command line:
--prefix=$HOME
to make the compiled library end up in $HOME/lib
--with-jdk=no
because we will not need the Java version of the library (without this, you will run into problems if Java is not installed)CXXFLAGS="-std=c++11"
to compile according to the C++11 standard (without this, you may have trouble linking the library to your executable, which will use C++11 features)--enable-static
to create static library files that can be linked directy into your executable../autogen.sh
./configure --prefix=$HOME --with-jdk=no CXXFLAGS="-std=c++11" --enable-static
make
make install
If the ./autogen.sh
command generates an error (autoreconf: not found
), you will need to install Gnu autotools and autoreconf. The easiest way to do this is to first install Homebrew:
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
and then use homebrew to install autotools:
brew install autoconf
brew install automake
You can ignore (for the moment at least) the “OpenCL not found” and “NVIDIA CUDA nvcc compiler not found” warnings issued by configure. We will only be using the CPU version of the library for this tutorial. If you later want to take advantage of GPU resources, you will need to supply either OpenCL or NVIDIA and recompile BeagleLib.
Move all files in ~/lib that have names beginning with “libhmsbeagle” and having an *.a file name extension into your ~/lib/static folder.
Once you have compiled BeagleLib, you need to tell Xcode about it so that it can be linked into your executable.
In the Xcode main menu, choose View > Navigators > Show Project Navigator, then click on the blue project icon labeled strom at the top of the Project Navigator. Click on the Build Phases tab. Click the +
sign in Link Binary With Libraries and (after clicking the Add Other… button) navigate to the directory ~/lib/static and select the libhmsbeagle.a, libhmsbeagle-cpu.a, and libhmsbeagle-cpu-sse.a files. This informs Xcode where to find the compiled library code for purposes of linking the library into your program’s executable file.
Go ahead and run strom now. I’ll discuss the output a bit at a time in the sections below.
The run
function first creates a Data
object and stores a shared pointer to it in the private data member _data
. After informing the Data
object about any partition that the user set up, the run
function reads the file whose name is stored in _data_file_name
(rbcL.nex) and spits out a summary of the data therein. We have not set up data partitioning so, this summary just reports a single data subset containing 60 sites compressed into 21 distinct site patterns:
*** Reading and storing the data in the file rbcL.nex
storing implied block: TAXA
storing read block: DATA
Number of taxa: 14
Number of partition subsets: 1
Subset 1 (default)
data type: nucleotide
sites: 60
patterns: 21
The run
function next creates a Likelihood
object and stores a shared pointer to that object in the private data member _likelihood
. It then calls the Likelihood::beagleLibVersion
function to show which version of BeagleLib we’re using and the Likelihood::availableResources
function to show a summary of resources available to BeagleLib, which on my laptop is just the default CPU plugin of BeagleLib. Next, the Likelihood::setData
function is called to make the compressed data from the Data object available. Finally, the Likelihood::initBeagleLib
function is called to construct the BeagleLib machinery needed to do the likelihood calculations:
*** Resources available to BeagleLib 3.1.2:
0: CPU
*** Creating the likelihood calculator
Created BeagleLib instance 0 (4 states, 1 rate, 1 subset)
Next, a TreeSummary
object is created and stored in the shared pointer private data member _tree_summary
and its readTreefile
function is called to read the tree file whose name is _tree_file_name
(rbcLjc.tre), which you just created. A Tree
shared pointer named tree
is created and pointed to the first (and only) tree read from rbcLjc.tre. This results in the following output:
*** Reading and storing the first tree in the file rbcLjc.tre
Warning:
A TAXA block should be read before the TREES block (but no TAXA block was found). Taxa will be inferred from their usage in the TREES block.
at line 5, column (approximately) 1 (file position 33)
storing implied block: TAXA
storing read block: TREES
If the TAXA block warning produced by NCL bothers you, you have a couple of choices:
The first option has the advantage that it preserves the ability of NCL to warn you of possible problems, and is arguably the better solution. To take this route, add the following TAXA block to the rbcLjc.tre file between the opening #nexus
line and the begin trees;
statement:
begin taxa;
dimensions ntax=14;
taxlabels
Atractomorpha_echinata
Bracteacoccus_aerius
Bracteacoccus_minor
Chlorotetraedron_incus
Chromochloris_zofingiensis
Kirchneriella_aperta
Mychonastes_homosphaera
Neochloris_aquatica
Ourococcus_multisporus
Pediastrum_duplex
Pseudomuriella_schumacherensis
Rotundella_rotunda
Scenedesmus_obliquus
Stigeoclonium_helveticum;
end;
Implementing the second option involves changing this (highlighted) line in your TreeSummary::readTreefile
function (file tree_summary.hpp):
inline void TreeSummary::readTreefile(const std::string filename, unsigned skip) {
TreeManip tm;
Split::treeid_t splitset;
// See http://phylo.bio.ku.edu/ncldocs/v2.1/funcdocs/index.html for NCL documentation
<span style="color:#0000ff"><strong>MultiFormatReader nexusReader(-1, NxsReader::WARNINGS_TO_STDERR);</strong></span>
try {
nexusReader.ReadFilepath(filename.c_str(), MultiFormatReader::NEXUS_FORMAT);
}
catch(...) {
nexusReader.DeleteBlocksFromFactories();
throw;
}
Instead of specifying NxsReader::WARNINGS_TO_STDERR
here, you can instead specify NxsReader::IGNORE_WARNINGS
. The alternatives are specified in the WarningHandlingMode
enum in nxsreader.h (a source file in the Nexus Class Library). The only two other possibilities are NxsReader::WARNINGS_TO_STDOUT
and NxsReader::WARNINGS_ARE_ERRORS
(not advisable because it will cause your program to abort at even the slightest warning).
Finally, the calcLogLikelihood
function of the Likelihood
object is called to obtain the log-likelihood. This is output along with a hard-coded statement of what the log-likelihood is expected to be so that it is easy to confirm that the program is working:
*** Calculating the likelihood of the tree
log likelihood = -278.83767
(expecting -278.83767)
It is, admittedly, a little silly to let the user enter any data and tree file name on the command line (or strom.conf file) and then print out a pre-ordained expectation of the result; however, our only purpose here is to test whether the likelihood machinery we’ve set up is working: our program is not ready to be distributed for general use yet!
Most of the body of the run
function is wrapped in a try
block. If anything goes wrong and an XStrom
exception is thrown, the program will jump immediately to the corresponding catch
block and the message stored by the exception object will be displayed.
If we partition the data but use the same model for all subsets, we should get the same total log-likelihood. Add 3 lines to your strom.conf file to partition the data into 3 equal-sized subsets each containing sites corresponding to a different codon position:
datafile = rbcL.nex
treefile = rbcLjc.tre
subset = first:1-60\3
subset = second:2-60\3
subset = third:3-60\3
Now run your program again. You should see the following output, which lists details for the three subsets (note that there are more total site patterns now because site patterns cannot be combined if the same pattern is seen in different subsets). The log-likelihood is the same as for the unpartitioned case. We will need to add the ability to specify a substitution model (the next step in the tutorial) in order to apply different models to different subsets.
Partition subset first comprises sites 1-60\3 and has type nucleotide
Partition subset second comprises sites 2-60\3 and has type nucleotide
Partition subset third comprises sites 3-60\3 and has type nucleotide
Starting...
Current working directory: "/home/paul/Documents/strom/distr"
*** Reading and storing the data in the file rbcL.nex
storing implied block: TAXA
storing read block: DATA
Number of taxa: 14
Number of partition subsets: 3
Subset 1 (first)
data type: nucleotide
sites: 20
patterns: 7
Subset 2 (second)
data type: nucleotide
sites: 20
patterns: 5
Subset 3 (third)
data type: nucleotide
sites: 20
patterns: 17
*** Resources available to BeagleLib 3.1.2:
0: CPU
*** Creating the likelihood calculator
Created BeagleLib instance 0 (4 states, 1 rate, 3 subsets)
*** Reading and storing the first tree in the file rbcLjc.tre
Warning:
A TAXA block should be read before the TREES block (but no TAXA block was found). Taxa will be inferred from their usage in the TREES block.
at line 5, column (approximately) 1 (file position 33)
storing implied block: TAXA
storing read block: TREES
*** Calculating the likelihood of the tree
log likelihood = -278.83767
(expecting -278.83767)
Finished!