10.2 Test Likelihood

(Mac version)

< 10.1 | 10.2 | 11.0 >

Create a tree file

To test the Likelihood class, we need both a data set and a tree. You have already created a data file named rbcL.nex. Here is a tree file containing the maximum likelihood tree for the data in rbcL.nex. The log-likelihood of this tree under the JC69 model is -278.83767.

Create a file containing the text below and name it rbcLjc.tre. This file does not need to be in your project, so you can use any text editor to create it (e.g. BBEdit).

Here are the contents of the file:

#nexus 

begin trees;
	translate
		1 Atractomorpha_echinata,
		2 Bracteacoccus_aerius,
		3 Bracteacoccus_minor,
		4 Chlorotetraedron_incus,
		5 Chromochloris_zofingiensis,
		6 Kirchneriella_aperta,
		7 Mychonastes_homosphaera,
		8 Neochloris_aquatica,
		9 Ourococcus_multisporus,
		10 Pediastrum_duplex,
		11 Pseudomuriella_schumacherensis,
		12 Rotundella_rotunda,
		13 Scenedesmus_obliquus,
		14 Stigeoclonium_helveticum
		;
tree 'maxlike' = [&U] (1:0.052781,((((((((((2:0,3:0.051745):0.016986,5:0.070562):0.017071,(6:0.047517,11:0.067138):0.025270):0.016985,9:0):0.016955,4:0.034406):0.016955,7:0.051745):0,13:0):0.016955,12:0.016955):0.034406,8:0):0,10:0):0.052781,14:0.071604);
end;

Specifying the data and tree files

You will need to specify the data file and tree file either on the command line (using --datafile rbcL.nex --treefile rbcLjc.tre) or create a text file named strom.conf containing these two lines:

datafile = rbcL.nex
treefile = rbcLjc.tre

Modifying the Strom class

Add the highlighted lines to the Strom class declaration in strom.hpp.

#pragma once    

#include &lt;iostream&gt;
#include "data.hpp"
<span style="color:#0000ff"><strong>#include "likelihood.hpp"</strong></span>
#include "tree_summary.hpp"
#include "partition.hpp"
#include &lt;boost/program_options.hpp&gt;
#include &lt;boost/filesystem.hpp&gt;
#include &lt;boost/algorithm/string/split.hpp&gt;
#include &lt;boost/algorithm/string/classification.hpp&gt;

namespace strom {

    class Strom {
        public:
                                                    Strom();
                                                    ~Strom();

            void                                    clear();
            void                                    processCommandLineOptions(int argc, const char * argv[]);
            void                                    run();
        
        private:

            std::string                             _data_file_name;
            std::string                             _tree_file_name;
            Partition::SharedPtr                    _partition;

            Data::SharedPtr                         _data; 
            <span style="color:#0000ff"><strong>Likelihood::SharedPtr                   _likelihood;</strong></span>
            TreeSummary::SharedPtr                  _tree_summary;

            <span style="color:#0000ff"><strong>bool                                    _use_gpu;</strong></span>
            <span style="color:#0000ff"><strong>bool                                    _ambig_missing;</strong></span>

            static std::string                      _program_name;
            static unsigned                         _major_version;
            static unsigned                         _minor_version;

    };
    

Initialize gpu and ambigmissing in the clear function:

    inline void Strom::clear() {    
        _data_file_name = "";
        _tree_file_name = "";
        _tree_summary   = nullptr;
        _partition.reset(new Partition());
        <span style="color:#0000ff"><strong>_use_gpu        = true;</strong></span>
        <span style="color:#0000ff"><strong>_ambig_missing  = true;</strong></span>
        _data = nullptr; 
        _likelihood = nullptr;
    }   

Add the required() call in Strom::processCommandLineOptions to the treefile command line option and add options for gpu and ambigmissing:

    inline void Strom::processCommandLineOptions(int argc, const char * argv[]) {   
        std::vector&lt;std::string&gt; partition_subsets;
        boost::program_options::variables_map vm;
        boost::program_options::options_description desc("Allowed options");    
        desc.add_options()
            ("help,h", "produce help message")
            ("version,v", "show program version")
            ("datafile,d",  boost::program_options::value(&_data_file_name)-&gt;required(), "name of a data file in NEXUS format")
            <span style="color:#0000ff"><strong>("treefile,t",  boost::program_options::value(&_tree_file_name)-&gt;required(), "name of a tree file in NEXUS format")</strong></span>
            ("subset",  boost::program_options::value(&partition_subsets), "a string defining a partition subset, e.g. 'first:1-1234\3' or 'default[codon:standard]:1-3702'")
            <span style="color:#0000ff"><strong>("gpu",           boost::program_options::value(&_use_gpu)-&gt;default_value(true),                "use GPU if available")</strong></span>
            <span style="color:#0000ff"><strong>("ambigmissing",  boost::program_options::value(&_ambig_missing)-&gt;default_value(true),          "treat all ambiguities as missing data")</strong></span>
        ;
        boost::program_options::store(boost::program_options::parse_command_line(argc, argv, desc), vm);
        try {
            const boost::program_options::parsed_options & parsed = boost::program_options::parse_config_file&lt; char &gt;("strom.conf", desc, false);
            boost::program_options::store(parsed, vm);
        }
        catch(boost::program_options::reading_file & x) {
            std::cout &lt;&lt; "Note: configuration file (strom.conf) not found" &lt;&lt; std::endl;
        }
        boost::program_options::notify(vm);

        // If user specified --help on command line, output usage summary and quit
        if (vm.count("help") &gt; 0) {
            std::cout &lt;&lt; desc &lt;&lt; "\n";
            std::exit(1);
        }

        // If user specified --version on command line, output version and quit
        if (vm.count("version") &gt; 0) {
            std::cout &lt;&lt; boost::str(boost::format("This is %s version %d.%d") % _program_name % _major_version % _minor_version) &lt;&lt; std::endl;
            std::exit(1);
        }
    
        // If user specified --subset on command line, break specified partition subset 
        // definition into name and character set string and add to _partition
        if (vm.count("subset") &gt; 0) {
            _partition.reset(new Partition());
            for (auto s : partition_subsets) {
                _partition-&gt;parseSubsetDefinition(s);
            }
        }
    }   

You could remove the required() call from all command line options; I only use it here because this means program_options takes care of informing the user if they forget to specify a data or tree file on the command line or in the strom.conf file. Because the program, as currently configured, will only work if given both a data file and a tree file, it makes sense to require these options.

Replace the body of the Strom::run function in the file strom.hpp with this revised version.

    inline void Strom::run() {  
        std::cout &lt;&lt; "Starting..." &lt;&lt; std::endl;
        std::cout &lt;&lt; "Current working directory: " &lt;&lt; boost::filesystem::current_path() &lt;&lt; std::endl;

        try {
            std::cout &lt;&lt; "\n*** Reading and storing the data in the file " &lt;&lt; _data_file_name &lt;&lt; std::endl;
            _data = Data::SharedPtr(new Data());
            _data-&gt;setPartition(_partition);
            _data-&gt;getDataFromFile(_data_file_name);

            // Report information about data partition subsets
            unsigned nsubsets = _data-&gt;getNumSubsets();
            std::cout &lt;&lt; "\nNumber of taxa: " &lt;&lt; _data-&gt;getNumTaxa() &lt;&lt; std::endl;
            std::cout &lt;&lt; "Number of partition subsets: " &lt;&lt; nsubsets &lt;&lt; std::endl;
            for (unsigned subset = 0; subset &lt; nsubsets; subset++) {
                DataType dt = _partition-&gt;getDataTypeForSubset(subset);
                std::cout &lt;&lt; "  Subset " &lt;&lt; (subset+1) &lt;&lt; " (" &lt;&lt; _data-&gt;getSubsetName(subset) &lt;&lt; ")" &lt;&lt; std::endl;
                std::cout &lt;&lt; "    data type: " &lt;&lt; dt.getDataTypeAsString() &lt;&lt; std::endl;
                std::cout &lt;&lt; "    sites:     " &lt;&lt; _data-&gt;calcSeqLenInSubset(subset) &lt;&lt; std::endl;
                std::cout &lt;&lt; "    patterns:  " &lt;&lt; _data-&gt;getNumPatternsInSubset(subset) &lt;&lt; std::endl;
                std::cout &lt;&lt; "    ambiguity: " &lt;&lt; (_ambig_missing || dt.isCodon() ? "treated as missing data (faster)" : "handled appropriately (slower)") &lt;&lt; std::endl;
                }
            
            std::cout &lt;&lt; "\n*** Resources available to BeagleLib " &lt;&lt; _likelihood-&gt;beagleLibVersion() &lt;&lt; ":\n";
            std::cout &lt;&lt; _likelihood-&gt;availableResources() &lt;&lt; std::endl;

            std::cout &lt;&lt; "\n*** Creating the likelihood calculator" &lt;&lt; std::endl;
            _likelihood = Likelihood::SharedPtr(new Likelihood());
            _likelihood-&gt;setPreferGPU(_use_gpu);
            _likelihood-&gt;setAmbiguityEqualsMissing(_ambig_missing);
            _likelihood-&gt;setData(_data);
            _likelihood-&gt;initBeagleLib();

            std::cout &lt;&lt; "\n*** Reading and storing the first tree in the file " &lt;&lt; _tree_file_name &lt;&lt; std::endl;
            _tree_summary = TreeSummary::SharedPtr(new TreeSummary());
            _tree_summary-&gt;readTreefile(_tree_file_name, 0);
            Tree::SharedPtr tree = _tree_summary-&gt;getTree(0);

            if (tree-&gt;numLeaves() != _data-&gt;getNumTaxa())
                throw XStrom(boost::format("Number of taxa in tree (%d) does not equal the number of taxa in the data matrix (%d)") % tree-&gt;numLeaves() % _data-&gt;getNumTaxa());

            std::cout &lt;&lt; "\n*** Calculating the likelihood of the tree" &lt;&lt; std::endl;
            double lnL = _likelihood-&gt;calcLogLikelihood(tree);
            std::cout &lt;&lt; boost::str(boost::format("log likelihood = %.5f") % lnL) &lt;&lt; std::endl;
            std::cout &lt;&lt; "      (expecting -278.83767)" &lt;&lt; std::endl;
        }
        catch (XStrom & x) {
            std::cerr &lt;&lt; "Strom encountered a problem:\n  " &lt;&lt; x.what() &lt;&lt; std::endl;
        }

        std::cout &lt;&lt; "\nFinished!" &lt;&lt; std::endl;
    }    

Make Likelihood a friend of Tree and Node

The defineOperations member function body above accesses private data members of both the Tree and Node classes. To avoid compile-time errors, you will thus need to declare the Likelihood class to be a friend of both Tree and Node. In the node.hpp file, you should uncomment the 2 lines highlighted below:

#pragma once    

#include &lt;string&gt;
#include &lt;vector&gt;
#include  &lt;iostream&gt;
#include "split.hpp"

namespace strom {

    class Tree;
    class TreeManip;
    <span style="color:#0000ff"><strong>class Likelihood;</strong></span>
    //class Updater;

    class Node {
        friend class Tree;
        friend class TreeManip;
        <span style="color:#0000ff"><strong>friend class Likelihood;</strong></span>
        //friend class Updater;
        

Uncomment the same 2 lines in the tree.hpp file:

#pragma once    

#include &lt;memory&gt;
#include &lt;iostream&gt;
#include "node.hpp"

namespace strom {

    class TreeManip;
    <span style="color:#0000ff"><strong>class Likelihood;</strong></span>
    //class Updater;

    class Tree {

            friend class TreeManip;
            <span style="color:#0000ff"><strong>friend class Likelihood;</strong></span>
            //friend class Updater;
            

Installing BeagleLib

Our Likelihood class depends on BeagleLib to do all the heavy lifting with respect to calculating transition probabilities and the log-likelihood itself. Our next task is thus to install BeagleLib and tell our program where to find the shared library for linking and at run time.

Download BeagleLib

Start by downloading the file beagle-lib-master.zip from the BeagleLib GitHub site. Unzip the library and move the resulting directory, beagle-lib-master, to your libraries directory, which I will assume is ~/Documents/libraries.

Open a terminal window and navigate into your beagle-lib-master directory:

cd ~/Documents/libraries/beagle-lib-master

Building BeagleLib

Now build the library, specifying the following on the command line:

./autogen.sh
./configure --prefix=$HOME --with-jdk=no CXXFLAGS="-std=c++11" --enable-static
make
make install

If the ./autogen.sh command generates an error (autoreconf: not found), you will need to install Gnu autotools and autoreconf. The easiest way to do this is to first install Homebrew:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

and then use homebrew to install autotools:

brew install autoconf
brew install automake

You can ignore (for the moment at least) the “OpenCL not found” and “NVIDIA CUDA nvcc compiler not found” warnings issued by configure. We will only be using the CPU version of the library for this tutorial. If you later want to take advantage of GPU resources, you will need to supply either OpenCL or NVIDIA and recompile BeagleLib.

Move all files in ~/lib that have names beginning with “libhmsbeagle” and having an *.a file name extension into your ~/lib/static folder.

Tell Xcode about BeagleLib

Once you have compiled BeagleLib, you need to tell Xcode about it so that it can be linked into your executable.

In the Xcode main menu, choose View > Navigators > Show Project Navigator, then click on the blue project icon :blueproject: labeled strom at the top of the Project Navigator. Click on the Build Phases tab. Click the + sign in Link Binary With Libraries and (after clicking the Add Other… button) navigate to the directory ~/lib/static and select the libhmsbeagle.a, libhmsbeagle-cpu.a, and libhmsbeagle-cpu-sse.a files. This informs Xcode where to find the compiled library code for purposes of linking the library into your program’s executable file.

Run strom

Go ahead and run strom now. I’ll discuss the output a bit at a time in the sections below.

Reading and storing the data

The run function first creates a Data object and stores a shared pointer to it in the private data member _data. After informing the Data object about any partition that the user set up, the run function reads the file whose name is stored in _data_file_name (rbcL.nex) and spits out a summary of the data therein. We have not set up data partitioning so, this summary just reports a single data subset containing 60 sites compressed into 21 distinct site patterns:

*** Reading and storing the data in the file rbcL.nex
storing implied block: TAXA
storing read block: DATA

Number of taxa: 14
Number of partition subsets: 1
  Subset 1 (default)
    data type: nucleotide
    sites:     60
    patterns:  21

Creating a likelihood calculator

The run function next creates a Likelihood object and stores a shared pointer to that object in the private data member _likelihood. It then calls the Likelihood::beagleLibVersion function to show which version of BeagleLib we’re using and the Likelihood::availableResources function to show a summary of resources available to BeagleLib, which on my laptop is just the default CPU plugin of BeagleLib. Next, the Likelihood::setData function is called to make the compressed data from the Data object available. Finally, the Likelihood::initBeagleLib function is called to construct the BeagleLib machinery needed to do the likelihood calculations:

*** Resources available to BeagleLib 3.1.2:
0: CPU

*** Creating the likelihood calculator
Created BeagleLib instance 0 (4 states, 1 rate, 1 subset)

Reading and storing a tree

Next, a TreeSummary object is created and stored in the shared pointer private data member _tree_summary and its readTreefile function is called to read the tree file whose name is _tree_file_name (rbcLjc.tre), which you just created. A Tree shared pointer named tree is created and pointed to the first (and only) tree read from rbcLjc.tre. This results in the following output:

*** Reading and storing the first tree in the file rbcLjc.tre

Warning:  
 A TAXA block should be read before the TREES block (but no TAXA block was found).  Taxa will be inferred from their usage in the TREES block.
at line 5, column (approximately) 1 (file position 33)
storing implied block: TAXA
storing read block: TREES

TAXA block warning

If the TAXA block warning produced by NCL bothers you, you have a couple of choices:

  1. You can provide a TAXA block in your tree file rbcLjc.tre
  2. You can suppress warnings when you read the tree file.

The first option has the advantage that it preserves the ability of NCL to warn you of possible problems, and is arguably the better solution. To take this route, add the following TAXA block to the rbcLjc.tre file between the opening #nexus line and the begin trees; statement:

begin taxa;
  dimensions ntax=14;
  taxlabels
		Atractomorpha_echinata
		Bracteacoccus_aerius
		Bracteacoccus_minor
		Chlorotetraedron_incus
		Chromochloris_zofingiensis
		Kirchneriella_aperta
		Mychonastes_homosphaera
		Neochloris_aquatica
		Ourococcus_multisporus
		Pediastrum_duplex
		Pseudomuriella_schumacherensis
		Rotundella_rotunda
		Scenedesmus_obliquus
		Stigeoclonium_helveticum;
end;

Implementing the second option involves changing this (highlighted) line in your TreeSummary::readTreefile function (file tree_summary.hpp):

    inline void TreeSummary::readTreefile(const std::string filename, unsigned skip) {  
        TreeManip tm;
        Split::treeid_t splitset;

        // See http://phylo.bio.ku.edu/ncldocs/v2.1/funcdocs/index.html for NCL documentation

        <span style="color:#0000ff"><strong>MultiFormatReader nexusReader(-1, NxsReader::WARNINGS_TO_STDERR);</strong></span>
        try {
            nexusReader.ReadFilepath(filename.c_str(), MultiFormatReader::NEXUS_FORMAT);
        }
        catch(...) {
            nexusReader.DeleteBlocksFromFactories();
            throw;
        }   

Instead of specifying NxsReader::WARNINGS_TO_STDERR here, you can instead specify NxsReader::IGNORE_WARNINGS. The alternatives are specified in the WarningHandlingMode enum in nxsreader.h (a source file in the Nexus Class Library). The only two other possibilities are NxsReader::WARNINGS_TO_STDOUT and NxsReader::WARNINGS_ARE_ERRORS (not advisable because it will cause your program to abort at even the slightest warning).

Calculating the likelihood of the tree

Finally, the calcLogLikelihood function of the Likelihood object is called to obtain the log-likelihood. This is output along with a hard-coded statement of what the log-likelihood is expected to be so that it is easy to confirm that the program is working:

*** Calculating the likelihood of the tree
log likelihood = -278.83767
      (expecting -278.83767)

It is, admittedly, a little silly to let the user enter any data and tree file name on the command line (or strom.conf file) and then print out a pre-ordained expectation of the result; however, our only purpose here is to test whether the likelihood machinery we’ve set up is working: our program is not ready to be distributed for general use yet!

Catching exceptions

Most of the body of the run function is wrapped in a try block. If anything goes wrong and an XStrom exception is thrown, the program will jump immediately to the corresponding catch block and the message stored by the exception object will be displayed.

Test likelihood calculation on partitioned data

If we partition the data but use the same model for all subsets, we should get the same total log-likelihood. Add 3 lines to your strom.conf file to partition the data into 3 equal-sized subsets each containing sites corresponding to a different codon position:

datafile = rbcL.nex
treefile = rbcLjc.tre
subset = first:1-60\3
subset = second:2-60\3
subset = third:3-60\3

Now run your program again. You should see the following output, which lists details for the three subsets (note that there are more total site patterns now because site patterns cannot be combined if the same pattern is seen in different subsets). The log-likelihood is the same as for the unpartitioned case. We will need to add the ability to specify a substitution model (the next step in the tutorial) in order to apply different models to different subsets.

Partition subset first comprises sites 1-60\3 and has type nucleotide
Partition subset second comprises sites 2-60\3 and has type nucleotide
Partition subset third comprises sites 3-60\3 and has type nucleotide
Starting...
Current working directory: "/home/paul/Documents/strom/distr"

*** Reading and storing the data in the file rbcL.nex
storing implied block: TAXA
storing read block: DATA

Number of taxa: 14
Number of partition subsets: 3
  Subset 1 (first)
    data type: nucleotide
    sites:     20
    patterns:  7
  Subset 2 (second)
    data type: nucleotide
    sites:     20
    patterns:  5
  Subset 3 (third)
    data type: nucleotide
    sites:     20
    patterns:  17

*** Resources available to BeagleLib 3.1.2:
0: CPU

*** Creating the likelihood calculator
Created BeagleLib instance 0 (4 states, 1 rate, 3 subsets)

*** Reading and storing the first tree in the file rbcLjc.tre

Warning:
 A TAXA block should be read before the TREES block (but no TAXA block was found).  Taxa will be inferred from their usage in the TREES block.
at line 5, column (approximately) 1 (file position 33)
storing implied block: TAXA
storing read block: TREES

*** Calculating the likelihood of the tree
log likelihood = -278.83767
      (expecting -278.83767)

Finished!

< 10.1 | 10.2 | 11.0 >