17.5 Testing TreeUpdater and TreeLengthUpdater

(Mac version)

< 17.4 | 17.5 | 18.0 >

Run the program using this strom.conf file.

datafile    = rbcl10.nex    
treefile    = rbcl10.tre

subset      = first:1-1314\3
subset      = second:2-1314\3
subset      = third:3-1314\3

statefreq   = default:0.294667,0.189172,0.206055,0.310106
rmatrix     = default:0.06082,0.27887,0.06461,0.06244,0.48492,0.04834
ratevar     = default:1.0
ratevar     = third:1.0
pinvar      = default:0.1
pinvar      = third:0.1
ncateg      = default:4
relrate     = default:1,2,3
niter       = 10000
samplefreq  = 10
printfreq   = 100
expectedLnL = -6676.396     

This time take a look at the file trees.tre and note how the tree topology and edge lengths vary during the run. You may wish to load the tree file into the program FigTree to view the trees.

Exploring the prior

A good way to test Bayesian MCMC software is to allow the program to explore the prior rather than the posterior. Because the prior is known, this provides a good sanity check of the MCMC machinery.

Modify the Strom::run function, adding just the one line highlighted in bold, blue text. The default behavior is to use data if it is available (see the Likelihood::clear function). The line we’re adding sets Likelihood::_using_data to false, which causes the Likelihood::calcLogLikelihood to always return 0.0 rather than computing the log-likelihood.

    inline void Strom::run() {  
        std::cout &lt;&lt; "Starting..." &lt;&lt; std::endl;
        std::cout &lt;&lt; "Current working directory: " &lt;&lt; boost::filesystem::current_path() &lt;&lt; std::endl;

        try {
            std::cout &lt;&lt; "\n*** Reading and storing the data in the file " &lt;&lt; _data_file_name &lt;&lt; std::endl;
            _data = Data::SharedPtr(new Data());
            _data-&gt;setPartition(_partition);
            _data-&gt;getDataFromFile(_data_file_name);
            
            _model-&gt;setSubsetNumPatterns(_data-&gt;calcNumPatternsVect());
            _model-&gt;setSubsetSizes(_partition-&gt;calcSubsetSizes());
            _model-&gt;activate();

            // Report information about data partition subsets
            unsigned nsubsets = _data-&gt;getNumSubsets();
            std::cout &lt;&lt; "\nNumber of taxa: " &lt;&lt; _data-&gt;getNumTaxa() &lt;&lt; std::endl;
            std::cout &lt;&lt; "Number of partition subsets: " &lt;&lt; nsubsets &lt;&lt; std::endl;
            for (unsigned subset = 0; subset &lt; nsubsets; subset++) {
                DataType dt = _partition-&gt;getDataTypeForSubset(subset);
                std::cout &lt;&lt; "  Subset " &lt;&lt; (subset+1) &lt;&lt; " (" &lt;&lt; _data-&gt;getSubsetName(subset) &lt;&lt; ")" &lt;&lt; std::endl;
                std::cout &lt;&lt; "    data type: " &lt;&lt; dt.getDataTypeAsString() &lt;&lt; std::endl;
                std::cout &lt;&lt; "    sites:     " &lt;&lt; _data-&gt;calcSeqLenInSubset(subset) &lt;&lt; std::endl;
                std::cout &lt;&lt; "    patterns:  " &lt;&lt; _data-&gt;getNumPatternsInSubset(subset) &lt;&lt; std::endl;
                std::cout &lt;&lt; "    ambiguity: " &lt;&lt; (_ambig_missing || dt.isCodon() ? "treated as missing data (faster)" : "handled appropriately (slower)") &lt;&lt; std::endl;
                }

            std::cout &lt;&lt; "\n*** Model description" &lt;&lt; std::endl;
            std::cout &lt;&lt; _model-&gt;describeModel() &lt;&lt; std::endl;

            std::cout &lt;&lt; "\n*** Creating the likelihood calculator" &lt;&lt; std::endl;
            _likelihood = Likelihood::SharedPtr(new Likelihood());
            _likelihood-&gt;setPreferGPU(_use_gpu);
            _likelihood-&gt;setAmbiguityEqualsMissing(_ambig_missing);
            _likelihood-&gt;setData(_data);
            _likelihood-&gt;useUnderflowScaling(_use_underflow_scaling);
            _likelihood-&gt;setModel(_model);
            _likelihood-&gt;initBeagleLib();

            std::cout &lt;&lt; "\n*** Resources available to BeagleLib " &lt;&lt; _likelihood-&gt;beagleLibVersion() &lt;&lt; ":\n";
            std::cout &lt;&lt; "Preferred resource: " &lt;&lt; (_use_gpu ? "GPU" : "CPU") &lt;&lt; std::endl;
            std::cout &lt;&lt; "Available resources:" &lt;&lt; std::endl;
            std::cout &lt;&lt; _likelihood-&gt;availableResources() &lt;&lt; std::endl;
            std::cout &lt;&lt; "Resources used:" &lt;&lt; std::endl;
            std::cout &lt;&lt; _likelihood-&gt;usedResources() &lt;&lt; std::endl;

            std::cout &lt;&lt; "\n*** Reading and storing tree number " &lt;&lt; (_model-&gt;getTreeIndex() + 1) &lt;&lt; " in the file " &lt;&lt; _tree_file_name &lt;&lt; std::endl;
            _tree_summary = TreeSummary::SharedPtr(new TreeSummary());
            _tree_summary-&gt;readTreefile(_tree_file_name, 0);
            Tree::SharedPtr tree = _tree_summary-&gt;getTree(_model-&gt;getTreeIndex());
            std::string newick = _tree_summary-&gt;getNewick(_model-&gt;getTreeIndex());

            if (tree-&gt;numLeaves() != _data-&gt;getNumTaxa())
                throw XStrom(boost::format("Number of taxa in tree (%d) does not equal the number of taxa in the data matrix (%d)") % tree-&gt;numLeaves() % _data-&gt;getNumTaxa());

            std::cout &lt;&lt; "\n*** Calculating the likelihood of the tree" &lt;&lt; std::endl;
            TreeManip tm(tree);
            tm.selectAllPartials();
            tm.selectAllTMatrices();
            double lnL = _likelihood-&gt;calcLogLikelihood(tree);
            tm.deselectAllPartials();
            tm.deselectAllTMatrices();
            std::cout &lt;&lt; boost::str(boost::format("log likelihood = %.5f") % lnL) &lt;&lt; std::endl;
            
            if (_expected_log_likelihood != 0.0) 
                std::cout &lt;&lt; boost::str(boost::format("      (expecting %.3f)") % _expected_log_likelihood) &lt;&lt; std::endl;
            
            // Create a Lot object that generates (pseudo)random numbers   
            _lot = Lot::SharedPtr(new Lot);
            _lot-&gt;setSeed(_random_seed);

            // Create an output manager and open output files
            _output_manager.reset(new OutputManager);
            
            <span style="color:#0000ff"><strong>_likelihood-&gt;useStoredData(false);</strong></span>

            // Create a Chain object and take _num_iter steps
            Chain chain;
            unsigned num_free_parameters = chain.createUpdaters(_model, _lot, _likelihood);
            if (num_free_parameters &gt; 0) {
                _output_manager-&gt;outputConsole(boost::str(boost::format("\n%12s %12s %12s %12s") % "iteration" % "logLike" % "logPrior" % "TL"));
                _output_manager-&gt;openTreeFile("trees.tre", _data);
                _output_manager-&gt;openParameterFile("params.txt", _model);
                chain.setTreeFromNewick(newick);
                chain.start();
                sample(0, chain);
                for (unsigned iteration = 1; iteration &lt;= _num_iter; ++iteration) {
                    chain.nextStep(iteration);
                    sample(iteration, chain);
                }
                chain.stop();
                
                // Close output files
                _output_manager-&gt;closeTreeFile();
                _output_manager-&gt;closeParameterFile();
                
                // Summarize updater efficiency
                _output_manager-&gt;outputConsole("\nChain summary:");
                std::vector&lt;std::string&gt; names = chain.getUpdaterNames();
                std::vector&lt;double&gt; lambdas    = chain.getLambdas();
                std::vector&lt;double&gt; accepts    = chain.getAcceptPercentages();
                std::vector&lt;unsigned&gt; nupdates = chain.getNumUpdates();
                unsigned n = (unsigned)names.size();
                _output_manager-&gt;outputConsole(boost::str(boost::format("%30s %15s %15s %15s") % "Updater" % "Tuning Param." % "Accept %" % "No. Updates"));
                for (unsigned i = 0; i &lt; n; ++i) {
                    _output_manager-&gt;outputConsole(boost::str(boost::format("%30s %15.3f %15.1f %15d") % names[i] % lambdas[i] % accepts[i] % nupdates[i]));
                }         
            }
            else {
                _output_manager-&gt;outputConsole("MCMC skipped because there are no free parameters in the model");
            }
        }
        catch (XStrom & x) {
            std::cerr &lt;&lt; "Strom encountered a problem:\n  " &lt;&lt; x.what() &lt;&lt; std::endl;
        }

        std::cout &lt;&lt; "\nFinished!" &lt;&lt; std::endl;
    }   

Bump up the number of iterations (niter) in the strom.conf file to 100,000 in order to get an extremely good sample:

niter = 100000

Run the program again and view the results in Tracer.

Check the tree length prior

Exponential distribution with mean 10 In Tracer, bring the panel labeled “Marginal Density” to the front and click on “TL” in the Traces list. We specified the tree length prior to be Exponential with mean 10. To the right is a plot of this density for comparison. Because Tracer approximates the density using a generic kernel density estimator, which does not take into account the fact that an exponential 0.1 density must have height 0.1 at 0.0, the density shown in Tracer will not exactly fit the plot, but it should be close except for the gap on the left end.

Switch to the “Estimates” panel while still selecting TL in Traces: the mean of the distribution should be close to 10 and the variance close to 100.

Check the subset relative rates

The subset relative rates have a prior distribution related to Dirichlet(1,1,1). The difference lies in the fact that it is the product of subset rates and subset proportions that is distributed as Dirichlet(1,1,1), not the rates themselves. Select all three subset relative rates (m-0, m-1, and m-2) under Traces on the left. These three densities should be nearly identical (“Marginal Density” panel), and they should all have means close to 1.0 and variances close to 0.5.

Check the state frequencies

The state frequencies have prior Dirichlet(1,1,1,1). This is a flat Dirichlet distribution. Each component of a flat Dirichlet has variance equal to

variance of one component = (1/n)*(1 - 1/n)/(n + 1)

where n is the number of dimensions (i.e. the number of Dirichlet parameters). For nucleotide state frequencies, n = 4 so the variance of one component should equal (1/4)(3/4)/5 = 3/80 = 0.0375. Being flat, each state frequency has the same probability as any other state frequency, so the mean for each should be 1/4 = 0.25.

Select all 4 state frequencies for any subset (e.g. piA-0, piC-0, piG-0, and piT-0) under Traces on the left. These 4 densities should be nearly identical (“Marginal Density” panel), and they should all have means close to 1/4 and variances close to 0.0375.

Check the exchangeabilities

The exchangeabilities have prior Dirichlet(1,1,1,1,1,1). Select all 6 exchangeabilities for any subset (e.g. rAC-0, rAG-0, rAT-0, rCG-0, rCT-0, and rGT-0) under Traces on the left. These 6 densities should be nearly identical (“Marginal Density” panel), and they should all have means close to 0.167 and variances close to 0.0198. These values can be obtained using the formula given above with n = 6 rather than n = 4.

Check pinvar

The pinvar parameter has prior Beta(1,1), which is just a special case of a flat Dirichlet distribution. Hence, using n = 2, we find that pinvar should have mean 0.5 and variance (1/2)(1/2)/3 = 1/12 = 0.0833. Again, Tracer may not do well near 0 and 1 with this density due to the fact that it is using a “black-box” density estimator that doesn’t know that the density should drop to exactly 0.0 at pinvar values of 0.0 and 1.0.

Check the rate variance

Finally, the rate variance parameter has prior Exponential(1), which has mean 1 and variance 1.

On to heated chains

The next step of the tutorial will introduce multiple chains (Metropolis coupling) to the MCMC analysis to improve mixing. We will also add a program option (usedata) to make ignoring data (exploring the prior) easy.