10.2 Test Likelihood

(Win version)

< 10.1 | 10.2 | 11.0 >

Create a tree file

To test the Likelihood class, we need both a data set and a tree. You have already created a data file named rbcL.nex. Here is a tree file containing the maximum likelihood tree for the data in rbcL.nex. The log-likelihood of this tree under the JC69 model is -278.83767.

Create a file containing the text below inside your Data Files filter and name it rbcLjc.tre.

Here are the contents of the file:

#nexus 

begin trees;
	translate
		1 Atractomorpha_echinata,
		2 Bracteacoccus_aerius,
		3 Bracteacoccus_minor,
		4 Chlorotetraedron_incus,
		5 Chromochloris_zofingiensis,
		6 Kirchneriella_aperta,
		7 Mychonastes_homosphaera,
		8 Neochloris_aquatica,
		9 Ourococcus_multisporus,
		10 Pediastrum_duplex,
		11 Pseudomuriella_schumacherensis,
		12 Rotundella_rotunda,
		13 Scenedesmus_obliquus,
		14 Stigeoclonium_helveticum
		;
tree 'maxlike' = [&U] (1:0.052781,((((((((((2:0,3:0.051745):0.016986,5:0.070562):0.017071,(6:0.047517,11:0.067138):0.025270):0.016985,9:0):0.016955,4:0.034406):0.016955,7:0.051745):0,13:0):0.016955,12:0.016955):0.034406,8:0):0,10:0):0.052781,14:0.071604);
end;

Be sure that the rbcLjc.tre file was saved in your project directory (i.e. the same directory in which you saved rbcL.nex).

Specifying the data and tree files

You will need to specify the data file and tree file either on the command line (using --datafile rbcL.nex --treefile rbcLjc.tre) or create a text file named strom.conf containing these two lines:

datafile = rbcL.nex
treefile = rbcLjc.tre

Modifying the Strom class

Add the highlighted lines to the Strom class declaration in strom.hpp.

#pragma once    

#include &lt;iostream&gt;
#include "data.hpp"
<span style="color:#0000ff"><strong>#include "likelihood.hpp"</strong></span>
#include "tree_summary.hpp"
#include "partition.hpp"
#include &lt;boost/program_options.hpp&gt;
#include &lt;boost/filesystem.hpp&gt;
#include &lt;boost/algorithm/string/split.hpp&gt;
#include &lt;boost/algorithm/string/classification.hpp&gt;

namespace strom {

    class Strom {
        public:
                                                    Strom();
                                                    ~Strom();

            void                                    clear();
            void                                    processCommandLineOptions(int argc, const char * argv[]);
            void                                    run();
        
        private:

            std::string                             _data_file_name;
            std::string                             _tree_file_name;
            Partition::SharedPtr                    _partition;

            Data::SharedPtr                         _data; 
            <span style="color:#0000ff"><strong>Likelihood::SharedPtr                   _likelihood;</strong></span>
            TreeSummary::SharedPtr                  _tree_summary;

            <span style="color:#0000ff"><strong>bool                                    _use_gpu;</strong></span>
            <span style="color:#0000ff"><strong>bool                                    _ambig_missing;</strong></span>

            static std::string                      _program_name;
            static unsigned                         _major_version;
            static unsigned                         _minor_version;

    };
    

Initialize gpu and ambigmissing in the clear function:

    inline void Strom::clear() {    
        _data_file_name = "";
        _tree_file_name = "";
        _tree_summary   = nullptr;
        _partition.reset(new Partition());
        <span style="color:#0000ff"><strong>_use_gpu        = true;</strong></span>
        <span style="color:#0000ff"><strong>_ambig_missing  = true;</strong></span>
        _data = nullptr; 
        _likelihood = nullptr;
    }   

Add the required() call in Strom::processCommandLineOptions to the treefile command line option and add options for gpu and ambigmissing:

    inline void Strom::processCommandLineOptions(int argc, const char * argv[]) {   
        std::vector&lt;std::string&gt; partition_subsets;
        boost::program_options::variables_map vm;
        boost::program_options::options_description desc("Allowed options");    
        desc.add_options()
            ("help,h", "produce help message")
            ("version,v", "show program version")
            ("datafile,d",  boost::program_options::value(&_data_file_name)-&gt;required(), "name of a data file in NEXUS format")
            <span style="color:#0000ff"><strong>("treefile,t",  boost::program_options::value(&_tree_file_name)-&gt;required(), "name of a tree file in NEXUS format")</strong></span>
            ("subset",  boost::program_options::value(&partition_subsets), "a string defining a partition subset, e.g. 'first:1-1234\3' or 'default[codon:standard]:1-3702'")
            <span style="color:#0000ff"><strong>("gpu",           boost::program_options::value(&_use_gpu)-&gt;default_value(true),                "use GPU if available")</strong></span>
            <span style="color:#0000ff"><strong>("ambigmissing",  boost::program_options::value(&_ambig_missing)-&gt;default_value(true),          "treat all ambiguities as missing data")</strong></span>
        ;
        boost::program_options::store(boost::program_options::parse_command_line(argc, argv, desc), vm);
        try {
            const boost::program_options::parsed_options & parsed = boost::program_options::parse_config_file&lt; char &gt;("strom.conf", desc, false);
            boost::program_options::store(parsed, vm);
        }
        catch(boost::program_options::reading_file & x) {
            std::cout &lt;&lt; "Note: configuration file (strom.conf) not found" &lt;&lt; std::endl;
        }
        boost::program_options::notify(vm);

        // If user specified --help on command line, output usage summary and quit
        if (vm.count("help") &gt; 0) {
            std::cout &lt;&lt; desc &lt;&lt; "\n";
            std::exit(1);
        }

        // If user specified --version on command line, output version and quit
        if (vm.count("version") &gt; 0) {
            std::cout &lt;&lt; boost::str(boost::format("This is %s version %d.%d") % _program_name % _major_version % _minor_version) &lt;&lt; std::endl;
            std::exit(1);
        }
    
        // If user specified --subset on command line, break specified partition subset 
        // definition into name and character set string and add to _partition
        if (vm.count("subset") &gt; 0) {
            _partition.reset(new Partition());
            for (auto s : partition_subsets) {
                _partition-&gt;parseSubsetDefinition(s);
            }
        }
    }   

You could remove the required() call from all command line options; I only use it here because this means program_options takes care of informing the user if they forget to specify a data or tree file on the command line or in the strom.conf file. Because the program, as currently configured, will only work if given both a data file and a tree file, it makes sense to require these options.

Replace the body of the Strom::run function in the file strom.hpp with this revised version.

    inline void Strom::run() {  
        std::cout &lt;&lt; "Starting..." &lt;&lt; std::endl;
        std::cout &lt;&lt; "Current working directory: " &lt;&lt; boost::filesystem::current_path() &lt;&lt; std::endl;

        try {
            std::cout &lt;&lt; "\n*** Reading and storing the data in the file " &lt;&lt; _data_file_name &lt;&lt; std::endl;
            _data = Data::SharedPtr(new Data());
            _data-&gt;setPartition(_partition);
            _data-&gt;getDataFromFile(_data_file_name);

            // Report information about data partition subsets
            unsigned nsubsets = _data-&gt;getNumSubsets();
            std::cout &lt;&lt; "\nNumber of taxa: " &lt;&lt; _data-&gt;getNumTaxa() &lt;&lt; std::endl;
            std::cout &lt;&lt; "Number of partition subsets: " &lt;&lt; nsubsets &lt;&lt; std::endl;
            for (unsigned subset = 0; subset &lt; nsubsets; subset++) {
                DataType dt = _partition-&gt;getDataTypeForSubset(subset);
                std::cout &lt;&lt; "  Subset " &lt;&lt; (subset+1) &lt;&lt; " (" &lt;&lt; _data-&gt;getSubsetName(subset) &lt;&lt; ")" &lt;&lt; std::endl;
                std::cout &lt;&lt; "    data type: " &lt;&lt; dt.getDataTypeAsString() &lt;&lt; std::endl;
                std::cout &lt;&lt; "    sites:     " &lt;&lt; _data-&gt;calcSeqLenInSubset(subset) &lt;&lt; std::endl;
                std::cout &lt;&lt; "    patterns:  " &lt;&lt; _data-&gt;getNumPatternsInSubset(subset) &lt;&lt; std::endl;
                std::cout &lt;&lt; "    ambiguity: " &lt;&lt; (_ambig_missing || dt.isCodon() ? "treated as missing data (faster)" : "handled appropriately (slower)") &lt;&lt; std::endl;
                }
            
            std::cout &lt;&lt; "\n*** Resources available to BeagleLib " &lt;&lt; _likelihood-&gt;beagleLibVersion() &lt;&lt; ":\n";
            std::cout &lt;&lt; _likelihood-&gt;availableResources() &lt;&lt; std::endl;

            std::cout &lt;&lt; "\n*** Creating the likelihood calculator" &lt;&lt; std::endl;
            _likelihood = Likelihood::SharedPtr(new Likelihood());
            _likelihood-&gt;setPreferGPU(_use_gpu);
            _likelihood-&gt;setAmbiguityEqualsMissing(_ambig_missing);
            _likelihood-&gt;setData(_data);
            _likelihood-&gt;initBeagleLib();

            std::cout &lt;&lt; "\n*** Reading and storing the first tree in the file " &lt;&lt; _tree_file_name &lt;&lt; std::endl;
            _tree_summary = TreeSummary::SharedPtr(new TreeSummary());
            _tree_summary-&gt;readTreefile(_tree_file_name, 0);
            Tree::SharedPtr tree = _tree_summary-&gt;getTree(0);

            if (tree-&gt;numLeaves() != _data-&gt;getNumTaxa())
                throw XStrom(boost::format("Number of taxa in tree (%d) does not equal the number of taxa in the data matrix (%d)") % tree-&gt;numLeaves() % _data-&gt;getNumTaxa());

            std::cout &lt;&lt; "\n*** Calculating the likelihood of the tree" &lt;&lt; std::endl;
            double lnL = _likelihood-&gt;calcLogLikelihood(tree);
            std::cout &lt;&lt; boost::str(boost::format("log likelihood = %.5f") % lnL) &lt;&lt; std::endl;
            std::cout &lt;&lt; "      (expecting -278.83767)" &lt;&lt; std::endl;
        }
        catch (XStrom & x) {
            std::cerr &lt;&lt; "Strom encountered a problem:\n  " &lt;&lt; x.what() &lt;&lt; std::endl;
        }

        std::cout &lt;&lt; "\nFinished!" &lt;&lt; std::endl;
    }    

Make Likelihood a friend of Tree and Node

The defineOperations member function body above accesses private data members of both the Tree and Node classes. To avoid compile-time errors, you will thus need to declare the Likelihood class to be a friend of both Tree and Node. In the node.hpp file, you should uncomment the 2 lines highlighted below:

#pragma once    

#include &lt;string&gt;
#include &lt;vector&gt;
#include  &lt;iostream&gt;
#include "split.hpp"

namespace strom {

    class Tree;
    class TreeManip;
    <span style="color:#0000ff"><strong>class Likelihood;</strong></span>
    //class Updater;

    class Node {
        friend class Tree;
        friend class TreeManip;
        <span style="color:#0000ff"><strong>friend class Likelihood;</strong></span>
        //friend class Updater;
        

Uncomment the same 2 lines in the tree.hpp file:

#pragma once    

#include &lt;memory&gt;
#include &lt;iostream&gt;
#include "node.hpp"

namespace strom {

    class TreeManip;
    <span style="color:#0000ff"><strong>class Likelihood;</strong></span>
    //class Updater;

    class Tree {

            friend class TreeManip;
            <span style="color:#0000ff"><strong>friend class Likelihood;</strong></span>
            //friend class Updater;
            

Installing BeagleLib

Our Likelihood class depends on BeagleLib to do all the heavy lifting with respect to calculating transition probabilities and the log-likelihood itself. Our next task is thus to install BeagleLib and tell our program where to find the shared library for linking and at run time.

Start by downloading the file beagle-lib-master.zip from the BeagleLib GitHub site. Unzip the library (using 7-zip) and move the resulting directory, beagle-lib-master, to your libraries directory (and delete the original zip file). The directory you just created will be referred to hereafter as BEAGLELIB_ROOT. On my computer, BEAGLELIB_ROOT corresponds to this directory:

C:\Users\Paul Lewis\Documents\libraries\beagle-lib-master

Navigate to BEAGLELIB_ROOT and open the BEAGLELIB_ROOT\project\beagle-vs-2017\libhmsbeagle.sln Visual Studio solution. Visual Studio Community (VSC19) will issues some warnings and will say “One or more projects in the solution were not loaded correctly…”, but then will give you the opportunity to upgrade, which you should do.

Click on Solution Explorer tab on the left side of VSC19, then click once on the libhsmbeagle project to select it.

Choose Build libhmsbeagle under the Build menu.

The last line of the output window should say “Build: 2 succeeded, 0 failed, 0 up-to-date, 0 skipped”. If instead it says “Build: 1 succeeded, 1 failed, 0 up-to-date, 0 skipped”, it probably means that you do not have Java installed. In this case, you will need to prevent VSC19 from attempting to build the JNI part of the libhmsbeagle project. Expand (using the right-pointing triangle symbol) the libhmsbeagle project within the libhmsbeagle solution in the Solution Explorer. Within the libhmsbeagle project you should see a libhmsbeagle folder containing 2 subfolders named JNI and plugin, along with files named beagle.cpp, beagle.h, BeagleImpl.h and platform.h. Expand the JNI folder and for each of the 2 files inside (beagle_BeagleJNIWrapper.cpp and beagle_BeagleJNIWrapper.h), right-click the file, choose Properties, then choose the value yes for Excluded From Build. When you are done, both of these files will have a tiny red minus sign symbol beside their name in the Solution Explorer.

Once compilation is completed, you should find a BEAGLELIB_ROOT\project\beagle-vs-2017\x64\Debug folder containing several files, including libhmsbeagle.lib hmsbeagle64D.dll, and hmsbeagle-cpu64D-31.dll. The two dynamic link library (dll) files contain code that is loaded when needed by the operating system (that is, most of the functionality in BeagleLib will not be in our program strom, but strom has access to it when it needs it). We will tell our strom project about the library (lib) file because it allows strom to know what functions are available to it in the DLLs.

Before leaving the libhmsbeagle Solution, switch to Release using the Solution Configuration dropdown at the top and build again to create a Release folder and release versions of the library.

Copy the debug versions of the dynamic link libraries (DLLs) to the location of your strom debug executable (and the release versions of the DLLs to the location of your strom release executable) so that they will be found at run time. The two debug-version DLLs that you need to copy are named hmsbeagle64D.dll and hmsbeagle-cpu64D-31.dll and are located in BEAGLELIB_ROOT/project/beagle-vs-2017/x64/Debug. Likewise, the two release-version DLLs that you need to copy are named hmsbeagle64.dll and hmsbeagle-cpu64-31.dll and are located in BEAGLELIB_ROOT\project\beagle-vs-2017\x64\Release.The places they should be copied to are the x64\Debug and x64\Release directories, respectively, inside your strom solution directory. For example, on my computer, this is what my strom Debug and Release directories look like after copying these files (note that I am only showing the directory paths and the two files added to each directory; there are many other files in each directory, including the strom executable):

C:\Users\Paul Lewis\source\repos\strom\x64\Debug
    hmsbeagle64D.dll
    hmsbeagle-cpu64D-31.dll
C:\Users\Paul Lewis\source\repos\strom\x64\Release
    hmsbeagle64.dll
    hmsbeagle-cpu64-31.dll

You can now close the BEAGLELIB_ROOT\project\beagle-vs-2017\libhmsbeagle.sln solution and open the strom solution again.

In the Solution Explorer within the strom solution, right-click on the strom project and choose Properties. In the VC++ Directories section, add BEAGLELIB_ROOT to Include Directories (remember that we are using BEAGLELIB_ROOT to stand for the full path to the directory containing the BeagleLib code, so you should not type the word BEAGLELIB_ROOT directly; instead, use the button to locate the BeagleLib directory).

While still in the VC++ Directories section, add BEAGLELIB_ROOT\project\beagle-vs-2017\Debug to Library Directories.

In the Linker > Input section, add libhmsbeagle.lib to Additional Dependencies.

Run strom

Go ahead and run strom now. I’ll discuss the output a bit at a time in the sections below.

Reading and storing the data

The run function first creates a Data object and stores a shared pointer to it in the private data member _data. After informing the Data object about any partition that the user set up, the run function reads the file whose name is stored in _data_file_name (rbcL.nex) and spits out a summary of the data therein. We have not set up data partitioning so, this summary just reports a single data subset containing 60 sites compressed into 21 distinct site patterns:

*** Reading and storing the data in the file rbcL.nex
storing implied block: TAXA
storing read block: DATA

Number of taxa: 14
Number of partition subsets: 1
  Subset 1 (default)
    data type: nucleotide
    sites:     60
    patterns:  21

Creating a likelihood calculator

The run function next creates a Likelihood object and stores a shared pointer to that object in the private data member _likelihood. It then calls the Likelihood::beagleLibVersion function to show which version of BeagleLib we’re using and the Likelihood::availableResources function to show a summary of resources available to BeagleLib, which on my laptop is just the default CPU plugin of BeagleLib. Next, the Likelihood::setData function is called to make the compressed data from the Data object available. Finally, the Likelihood::initBeagleLib function is called to construct the BeagleLib machinery needed to do the likelihood calculations:

*** Resources available to BeagleLib 3.1.2:
0: CPU

*** Creating the likelihood calculator
Created BeagleLib instance 0 (4 states, 1 rate, 1 subset)

Reading and storing a tree

Next, a TreeSummary object is created and stored in the shared pointer private data member _tree_summary and its readTreefile function is called to read the tree file whose name is _tree_file_name (rbcLjc.tre), which you just created. A Tree shared pointer named tree is created and pointed to the first (and only) tree read from rbcLjc.tre. This results in the following output:

*** Reading and storing the first tree in the file rbcLjc.tre

Warning:  
 A TAXA block should be read before the TREES block (but no TAXA block was found).  Taxa will be inferred from their usage in the TREES block.
at line 5, column (approximately) 1 (file position 33)
storing implied block: TAXA
storing read block: TREES

TAXA block warning

If the TAXA block warning produced by NCL bothers you, you have a couple of choices:

  1. You can provide a TAXA block in your tree file rbcLjc.tre
  2. You can suppress warnings when you read the tree file.

The first option has the advantage that it preserves the ability of NCL to warn you of possible problems, and is arguably the better solution. To take this route, add the following TAXA block to the rbcLjc.tre file between the opening #nexus line and the begin trees; statement:

begin taxa;
  dimensions ntax=14;
  taxlabels
		Atractomorpha_echinata
		Bracteacoccus_aerius
		Bracteacoccus_minor
		Chlorotetraedron_incus
		Chromochloris_zofingiensis
		Kirchneriella_aperta
		Mychonastes_homosphaera
		Neochloris_aquatica
		Ourococcus_multisporus
		Pediastrum_duplex
		Pseudomuriella_schumacherensis
		Rotundella_rotunda
		Scenedesmus_obliquus
		Stigeoclonium_helveticum;
end;

Implementing the second option involves changing this (highlighted) line in your TreeSummary::readTreefile function (file tree_summary.hpp):

    inline void TreeSummary::readTreefile(const std::string filename, unsigned skip) {  
        TreeManip tm;
        Split::treeid_t splitset;

        // See http://phylo.bio.ku.edu/ncldocs/v2.1/funcdocs/index.html for NCL documentation

        <span style="color:#0000ff"><strong>MultiFormatReader nexusReader(-1, NxsReader::WARNINGS_TO_STDERR);</strong></span>
        try {
            nexusReader.ReadFilepath(filename.c_str(), MultiFormatReader::NEXUS_FORMAT);
        }
        catch(...) {
            nexusReader.DeleteBlocksFromFactories();
            throw;
        }   

Instead of specifying NxsReader::WARNINGS_TO_STDERR here, you can instead specify NxsReader::IGNORE_WARNINGS. The alternatives are specified in the WarningHandlingMode enum in nxsreader.h (a source file in the Nexus Class Library). The only two other possibilities are NxsReader::WARNINGS_TO_STDOUT and NxsReader::WARNINGS_ARE_ERRORS (not advisable because it will cause your program to abort at even the slightest warning).

Calculating the likelihood of the tree

Finally, the calcLogLikelihood function of the Likelihood object is called to obtain the log-likelihood. This is output along with a hard-coded statement of what the log-likelihood is expected to be so that it is easy to confirm that the program is working:

*** Calculating the likelihood of the tree
log likelihood = -278.83767
      (expecting -278.83767)

It is, admittedly, a little silly to let the user enter any data and tree file name on the command line (or strom.conf file) and then print out a pre-ordained expectation of the result; however, our only purpose here is to test whether the likelihood machinery we’ve set up is working: our program is not ready to be distributed for general use yet!

Catching exceptions

Most of the body of the run function is wrapped in a try block. If anything goes wrong and an XStrom exception is thrown, the program will jump immediately to the corresponding catch block and the message stored by the exception object will be displayed.

Test likelihood calculation on partitioned data

If we partition the data but use the same model for all subsets, we should get the same total log-likelihood. Add 3 lines to your strom.conf file to partition the data into 3 equal-sized subsets each containing sites corresponding to a different codon position:

datafile = rbcL.nex
treefile = rbcLjc.tre
subset = first:1-60\3
subset = second:2-60\3
subset = third:3-60\3

Now run your program again. You should see the following output, which lists details for the three subsets (note that there are more total site patterns now because site patterns cannot be combined if the same pattern is seen in different subsets). The log-likelihood is the same as for the unpartitioned case. We will need to add the ability to specify a substitution model (the next step in the tutorial) in order to apply different models to different subsets.

Partition subset first comprises sites 1-60\3 and has type nucleotide
Partition subset second comprises sites 2-60\3 and has type nucleotide
Partition subset third comprises sites 3-60\3 and has type nucleotide
Starting...
Current working directory: "/home/paul/Documents/strom/distr"

*** Reading and storing the data in the file rbcL.nex
storing implied block: TAXA
storing read block: DATA

Number of taxa: 14
Number of partition subsets: 3
  Subset 1 (first)
    data type: nucleotide
    sites:     20
    patterns:  7
  Subset 2 (second)
    data type: nucleotide
    sites:     20
    patterns:  5
  Subset 3 (third)
    data type: nucleotide
    sites:     20
    patterns:  17

*** Resources available to BeagleLib 3.1.2:
0: CPU

*** Creating the likelihood calculator
Created BeagleLib instance 0 (4 states, 1 rate, 3 subsets)

*** Reading and storing the first tree in the file rbcLjc.tre

Warning:
 A TAXA block should be read before the TREES block (but no TAXA block was found).  Taxa will be inferred from their usage in the TREES block.
at line 5, column (approximately) 1 (file position 33)
storing implied block: TAXA
storing read block: TREES

*** Calculating the likelihood of the tree
log likelihood = -278.83767
      (expecting -278.83767)

Finished!

< 10.1 | 10.2 | 11.0 >