9.3 Create the Data class

(Linux version)

< 9.2 | 9.3 | 9.4 >

Create a new C++ class named Data and add it to your project as the header file data.hpp. Below is the class declaration. The body of each member function will be described separately (you should add each of these member function bodies just above the right curly bracket that terminates the namespace block).

#pragma once    

#include &lt;fstream&gt;
#include &lt;regex&gt;
#include &lt;string&gt;
#include &lt;vector&gt;
#include &lt;numeric&gt;
#include &lt;limits&gt;
#include &lt;map&gt;
#include &lt;boost/format.hpp&gt;
#include "genetic_code.hpp"
#include "datatype.hpp"
#include "partition.hpp"
#include "xstrom.hpp"
#include "ncl/nxsmultiformat.h"
#include &lt;boost/algorithm/string/join.hpp&gt;

namespace strom {

    class Data {
        public:
            typedef std::vector&lt;std::string&gt;            taxon_names_t;
            typedef unsigned long long                  state_t;
            typedef std::vector&lt;state_t&gt;                pattern_vect_t;
            typedef std::vector&lt;state_t&gt;                monomorphic_vect_t;
            typedef std::vector&lt;int&gt;                    partition_key_t;
            typedef std::map&lt;pattern_vect_t,unsigned&gt;   pattern_map_t;
            typedef std::vector&lt;pattern_vect_t&gt;         data_matrix_t;
            typedef std::vector&lt;pattern_map_t&gt;          pattern_map_vect_t;
            typedef std::vector&lt;double&gt;                 pattern_counts_t;
            typedef std::vector&lt;unsigned&gt;               subset_end_t;
            typedef std::vector&lt;unsigned&gt;               npatterns_vect_t;
            typedef std::pair&lt;unsigned, unsigned&gt;       begin_end_pair_t;
            typedef std::shared_ptr&lt;Data&gt;               SharedPtr;

                                                        Data();
                                                        ~Data();
        
            Partition::SharedPtr                        getPartition();
            void                                        setPartition(Partition::SharedPtr partition);

            void                                        getDataFromFile(const std::string filename);

            unsigned                                    getNumSubsets() const;
            std::string                                 getSubsetName(unsigned subset) const;

            unsigned                                    getNumTaxa() const;
            const taxon_names_t &                       getTaxonNames() const;

            unsigned                                    getNumPatterns() const;
            npatterns_vect_t                            calcNumPatternsVect() const;
            unsigned                                    getNumPatternsInSubset(unsigned subset) const;
            unsigned                                    getNumStatesForSubset(unsigned subset) const;
            unsigned                                    calcSeqLen() const;
            unsigned                                    calcSeqLenInSubset(unsigned subset) const;
            const data_matrix_t &                       getDataMatrix() const;
            begin_end_pair_t                            getSubsetBeginEnd(unsigned subset) const;
            const pattern_counts_t &                    getPatternCounts() const;
            const monomorphic_vect_t &                  getMonomorphic() const;
            const partition_key_t &                     getPartitionKey() const;


            void                                        clear();

        private:

            unsigned                                    storeTaxonNames(NxsTaxaBlock * taxaBlock, unsigned taxa_block_index);
            unsigned                                    storeData(unsigned ntax, unsigned nchar, NxsCharactersBlock * charBlock, NxsCharactersBlock::DataTypesEnum datatype);
            unsigned                                    buildSubsetSpecificMaps(unsigned ntaxa, unsigned seqlen, unsigned nsubsets);
            void                                        updatePatternMap(Data::pattern_vect_t & pattern, unsigned subset);
            void                                        compressPatterns();

            Partition::SharedPtr                        _partition;
            pattern_counts_t                            _pattern_counts;
            monomorphic_vect_t                          _monomorphic;
            partition_key_t                             _partition_key;
            pattern_map_vect_t                          _pattern_map_vect;
            taxon_names_t                               _taxon_names;
            data_matrix_t                               _data_matrix;
            subset_end_t                                _subset_end;
    };

    // Member function bodies go below here but above the right curly bracket that ends the namespace block
    
}   

The class declaration above defines a number of types that are introduced to simplify member function prototypes and data member definitions that follow:

taxon_names_t is used for variables that store a vector of taxon names
state_t is used for storing discrete character state codes
pattern_vect_t is used for variables that store site patterns (vectors of state codes)
monomorphic_vect_t is used for variables that store information about the state present at each constant site pattern
partition_key_t is used for variables that map site patterns to partition subsets
pattern_map_t is used for variables that map pattern counts to site patterns
data_matrix_t is used for variables that store vectors of site patterns
pattern_map_vect_t is used for a vector of pattern maps, one for each subset
pattern_counts_t is used for variables that store the number of sites having each data pattern
subset_end_t is used for a variable that stores (one past) the last pattern in each subset
npatterns_vect_t is used to store a vector of counts of the number of distinct data patterns in each subset
begin_end_pair_t is used to store a std::pair containing the beginning and ending pattern (index into the columns of _data_matrix) for a given subset
SharedPtr is a shared pointer to an object of type Data

These type definitions make it simpler to define variables of these types and to pass such variables into functions. Most of these are used at the bottom of the class declaration to declare variables that will store information read from a data file (_pattern_counts, _partition_key, _pattern_map_vect, _taxon_names, _data_matrix, and _subset_end).

Constructor and destructor

Here are the bodies of the constructor and destructor. As usual, the only thing that we have these functions do is to report when an object of the Data class is created or destroyed, and these lines have been commented out (you can uncomment them at any time for debugging purposes).

    inline Data::Data() {   
        //std::cout &lt;&lt; "Creating a Data object" &lt;&lt; std::endl;
        clear();
    }

    inline Data::~Data() {
        //std::cout &lt;&lt; "Destroying a Data object" &lt;&lt; std::endl;
    }   

The setPartition member function

The Data class needs access to the partition information that the user supplied in the strom.conf file. This information is stored in a Partition object, and this function provides a way to assign a Partition object to this Data object.

    inline void Data::setPartition(Partition::SharedPtr partition) {    
        _partition = partition;
    }    

Obtaining partition information

The next 3 member functions provide access: (1) to the Partition object (getPartition, through a shared pointer); (2) to the number of data subsets in the partition (getNumSubsets); and (3) to the names assigned to the data subsets by the user (getSubsetName).

    inline Partition::SharedPtr Data::getPartition() {    
        return _partition;
    }

    inline unsigned Data::getNumSubsets() const {
        return (_partition ? _partition-&gt;getNumSubsets() : 1);
    }
    
    inline std::string Data::getSubsetName(unsigned subset) const {
        return _partition ? _partition-&gt;getSubsetName(subset) : std::string("default");
    }    

Accessors

Next come 5 member functions (getPartitionKey, getPatternCounts, getMonomorphic, getTaxonNames, and getDataMatrix) that are all accessors.

    inline const Data::partition_key_t & Data::getPartitionKey() const {    
        return _partition_key;
    }
    
    inline const Data::pattern_counts_t & Data::getPatternCounts() const {
        return _pattern_counts;
    }
    
    inline const Data::monomorphic_vect_t & Data::getMonomorphic() const {
        return _monomorphic;
    }

    inline const Data::taxon_names_t & Data::getTaxonNames() const {
        return _taxon_names;
    }

    inline const Data::data_matrix_t & Data::getDataMatrix() const {
        return _data_matrix;
    }    

The getSubsetBeginEnd member function

The getSubsetBeginEnd function returns a std::pair in which the first and second elements are the beginning and ending pattern index in _data_matrix for the subset index supplied as an argument. The beginning index is the index to the first pattern in the focal subset, while the ending index is one beyond the last pattern in the subset (i.e. similar to C++ iterators).

    inline Data::begin_end_pair_t Data::getSubsetBeginEnd(unsigned subset) const {    
        assert(_subset_end.size() &gt; subset);
        if (subset == 0)
            return std::make_pair(0, _subset_end[0]);
        else
            return std::make_pair(_subset_end[subset-1], _subset_end[subset]);
    }    

The clear member function

Here is the body of the clear function, which simply empties data members that are vectors or maps.

    inline void Data::clear() {    
        _partition_key.clear();
        _pattern_counts.clear();
        _monomorphic.clear();
        _pattern_map_vect.clear();
        _taxon_names.clear();
        _data_matrix.clear();
        _subset_end.clear();
    }    

The getNumPatterns member function

The function getNumPatterns returns the total number of patterns stored in all subsets (which is simply the length of any row of the _data_matrix vector after the data have been compressed into unique patterns and pattern counts).

    inline unsigned Data::getNumPatterns() const {    
        if (_data_matrix.size() &gt; 0)
            return (unsigned)_data_matrix[0].size();
        else
            return 0;
    }    

The calcNumPatternsVect member function

The calcNumPatternsVect function returns a vector containing the number of patterns in each subset. The name begins with “calc” to indicate that the return value is a vector that is constructed on the fly (and thus involves more work than simply returning a stored value).

    inline Data::npatterns_vect_t Data::calcNumPatternsVect() const {    
        unsigned nsubsets = (unsigned)_subset_end.size();
        std::vector&lt;unsigned&gt; num_patterns_vect(nsubsets, 0);
        for (unsigned s = 0; s &lt; nsubsets; s++)
            num_patterns_vect[s] = getNumPatternsInSubset(s);
        return num_patterns_vect;
    }    

The getNumStatesForSubset member function

The getNumStatesForSubset function returns the number of states characterizing a specified data subset (e.g. 4 for nucleotide data, 61 for codons using the standard code, 20 for protein data, etc.).

    inline unsigned Data::getNumStatesForSubset(unsigned subset) const {    
        DataType data_type = _partition-&gt;getDataTypeForSubset(subset);
        return data_type.getNumStates();
    }    

The getNumPatternsInSubset member function

The getNumPatternsInSubset function returns the number of distinct patterns stored in a specified subset.

    inline unsigned Data::getNumPatternsInSubset(unsigned subset) const {    
        assert(_subset_end.size() &gt; subset);
        return (unsigned)_subset_end[subset] - (subset == 0 ? 0 : _subset_end[subset-1]);
    }    

The getNumTaxa member function

The getNumTaxa function is an accessor that simply returns the number of taxa.

    inline unsigned Data::getNumTaxa() const {    
        return (unsigned)_taxon_names.size();
    }    

The calcSeqLen member function

The function calcSeqLen returns the total number of sites stored, but it cannot simply return the number of elements in a row of the _data_matrix vector because that would equal the number of unique patterns, not the number of sites, assuming that compressPatterns has been called. The std::accumulate function sums the values in a vector between two elements delimited by the iterators provided in the first two arguments. The vector begin accumulated is _pattern_counts. The third argument supplies the initial value (zero in this case).

    inline unsigned Data::calcSeqLen() const {    
        return std::accumulate(_pattern_counts.begin(), _pattern_counts.end(), 0);
    }    

The calcSeqLenInSubset member function

The function calcSeqLenInSubset returns the number of sites stored in the subset whose index is provided. We again use the std::accumulate function, but need to provide it with the correct begin and end iterators. The indices representing the first and (one beyond the) last pattern in the subset is returned as the std::pair x by getSubsetBeginEnd. The iterator to the beginning of the needed range is equal to _pattern_counts.begin() plus x.first. The end iterator is obtained by advancing _pattern_counts.begin() by x.second units.

For example, if the first subset contains 5 patterns, then the first pattern in the second subset is obtained by moving an iterator 5 positions away from the iterator positioned at _pattern_counts.begin(). Note that _pattern_counts.begin() is not moved at all if the focal subset is the first subset.

    inline unsigned Data::calcSeqLenInSubset(unsigned subset) const {    
        begin_end_pair_t s = getSubsetBeginEnd(subset);
        return std::accumulate(_pattern_counts.begin() + s.first, _pattern_counts.begin() + s.second, 0);
    }    

The buildSubsetSpecificMaps member function

This function fills a map associating site patterns (keys) with counts (values). There is one such map constructed for each partition subset, and the data member _pattern_map_vect is a vector of such pattern maps. A site pattern is a vector of states present in all taxa. For example, if there were 4 taxa, then AAAA would be a common site pattern, as would CCCC, GGGG, and TTTT. These constant site patterns (same state present in all taxa) are more common in most data sets than variable site patterns such as AAGG or ACGT. Because the likelihood is identical for two instances of the exact same site pattern, it makes sense to just store such patterns once (and calculate the likelihood for them once) but keep track of how many instances there were of the pattern in the original data set. Such pooling cannot be done across partition subsets, however, because normally some component of the model differs among subsets and thus the likelihood must be computed separately for the same site pattern in different subsets. Hence, we compress each data subset into a pattern map and store a separate map for each subset in _pattern_map_vect.

The buildSubsetSpecificMaps function first obtains the vector of site ranges from the Partition object and clears _pattern_map_vect in preparation for refilling it from scratch. Each tuple in the tuples vector specifies a range of sites along with the index of the subset to which that range of sites belongs. The site loop constructs a pattern (vector of ntaxa states) from each site in the range and adds that pattern to the map stored at _pattern_map_vect[site_subset]. The work of adding the pattern to the correct map is handle by the function updatePatternMap (see below).

A final loop computes and stores the total number of site patterns across all data subsets in the variable npatterns, which is the returned value of this function.

    inline unsigned Data::buildSubsetSpecificMaps(unsigned ntaxa, unsigned seqlen, unsigned nsubsets) {    
        pattern_vect_t pattern(ntaxa);

        _pattern_map_vect.clear();
        _pattern_map_vect.resize(nsubsets);
        
        const Partition::partition_t & tuples = _partition-&gt;getSubsetRangeVect();
        for (auto & t : tuples) {
            unsigned site_begin  = std::get&lt;0&gt;(t);
            unsigned site_end    = std::get&lt;1&gt;(t);
            unsigned site_skip   = std::get&lt;2&gt;(t);
            unsigned site_subset = std::get&lt;3&gt;(t);
            for (unsigned site = site_begin; site &lt;= site_end; site += site_skip) {
                // Copy site into pattern
                for (unsigned taxon = 0; taxon &lt; ntaxa; ++taxon) {
                    pattern[taxon] = _data_matrix[taxon][site-1];
                }
                
                // Add this pattern to _pattern_map_vect element corresponding to subset site_subset
                updatePatternMap(pattern, site_subset);
            }
        }
        
        // Tally total number of patterns across all subsets
        unsigned npatterns = 0;
        for (auto & map : _pattern_map_vect) {
            npatterns += (unsigned)map.size();
        }
        
        return npatterns;
    }    

The updatePatternMap member function

This private member function adds 1 to the count stored for the supplied pattern in the supplied partition subset in the _pattern_map_vect vector. The function first asks (using the lower_bound function of the standard map class) what is the first key that equals pattern? If there is no entry for the supplied pattern, the lower_bound function will return _pattern_map_vect[subset].end(), in which case pattern will be added as a key to _pattern_map_vect[subset] with the value initially set to 1 (because this is the first time this pattern has been seen in this subset). If the lower_bound function returns an iterator not equal to _pattern_map_vect[subset].end(), then the iterator is situated at an actual map item and the count associated with this key is incremented by 1.

    inline void Data::updatePatternMap(Data::pattern_vect_t & pattern, unsigned subset) {    
        // If pattern is not already in pattern_map, insert it and set value to 1.
        // If it does exist, increment its current value.
        // (see item 24, p. 110, in Meyers' Efficient STL for more info on the technique used here)
        pattern_map_t::iterator lowb = _pattern_map_vect[subset].lower_bound(pattern);
        if (lowb != _pattern_map_vect[subset].end() && !(_pattern_map_vect[subset].key_comp()(pattern, lowb-&gt;first))) {
            // this pattern has already been seen
            lowb-&gt;second += 1;
        }
        else
            {
            // this pattern has not yet been seen
            _pattern_map_vect[subset].insert(lowb, pattern_map_t::value_type(pattern, 1));
        }
    }    

The compressPatterns member function

The member function compressPatterns calls buildSubsetSpecificMaps to store all data currently in _data_matrix in _pattern_map_vect, then it replaces the contents of _data_matrix with just the unique patterns and stores the number of sites exhibiting each pattern in the _pattern_counts data member. Once all the data has been thus compressed into unique site patterns and their counts, the _pattern_map_vect is cleared as it is now storing redundant information.

You will notice that before any real work is done, some sanity checks are performed. The first check is to make sure there is data stored (it makes no sense to compress an empty data matrix). Second, the number of stored sites is obtained and used to finalize the Partition object. The call to Partition::finalize performs other checks, such as confirming that the number of sites specified by the user-defined data partition is the same as the number of sites stored in _data_matrix, and that no sites have failed to be assigned to any partition subset.

Some of the details (e.g. the determination of which patterns are constant) will be postponed until the way state codes are stored is explained.

    inline void Data::compressPatterns() {    
        // Perform sanity checks
        if (_data_matrix.empty())
            throw XStrom("Attempted to compress an empty data matrix");
        
        unsigned ntaxa = (unsigned)_data_matrix.size();
        unsigned seqlen = (unsigned)_data_matrix[0].size();
        
        // Finalize partition
        unsigned nsubsets = getNumSubsets();
        _subset_end.resize(nsubsets);
        _partition-&gt;finalize(seqlen);

        // Compact the data, storing it in _pattern_map_vect
        unsigned npatterns = buildSubsetSpecificMaps(ntaxa, seqlen, nsubsets);
        _pattern_counts.assign(npatterns, 0);
        _monomorphic.assign(npatterns, 0);
        _partition_key.assign(npatterns, -1);

        // Rebuild _data_matrix to hold compact data, storing counts in _pattern_counts
        _data_matrix.resize(ntaxa);
        for (auto & row : _data_matrix) {
            row.resize(npatterns);
        }

        unsigned p = 0; 
        for (unsigned subset = 0; subset &lt; nsubsets; subset++) {
            for (auto & pc : _pattern_map_vect[subset]) {
                _pattern_counts[p] = pc.second; // record how many sites have pattern p
                _partition_key[p] = subset;     // record the subset to which pattern p belongs
                
                state_t constant_state = pc.first[0];
                unsigned t = 0;
                for (auto sc : pc.first) {
                    assert(sc &gt; 0);
                    constant_state &= sc;
                    _data_matrix[t][p] = sc;
                    ++t;
                }
                // constant_state equals 0 if polymorphic or state code of state present if monomorphic
                _monomorphic[p] = constant_state;
                ++p;
            }   
            
            _subset_end[subset] = p;

            // Everything for this subset has been transferred to _data_matrix and _pattern_counts,
            // so we can now free this memory
            _pattern_map_vect[subset].clear();
        }
    }    

The storeTaxonNames member function

This is a helper function used by the getDataFromFile member function (discussed below). It stores the taxon labels found in a Nexus taxa block in the vector _taxon_names. It is possible for there to be more than one taxa block in a data file. If more than one taxa block is encountered, this function checks that the order and labels for taxa in the second and subsequent taxa blocks are identical to those stored from the first taxa block. If not, an XStrom exception is thrown.

    inline unsigned Data::storeTaxonNames(NxsTaxaBlock * taxaBlock, unsigned taxa_block_index) {    
        unsigned ntax = 0;
        if (taxa_block_index == 0) {
            // First taxa block encountered in the file
            _taxon_names.clear();
            for (auto s : taxaBlock-&gt;GetAllLabels())
                _taxon_names.push_back(s);
            ntax = (unsigned)_taxon_names.size();
            _data_matrix.resize(ntax);
        }
        else {
            // Second (or later) taxa block encountered in the file
            // Check to ensure taxa block is identical to the first one
            for (auto s : taxaBlock-&gt;GetAllLabels()) {
                if (_taxon_names[ntax++] != s)
                    throw XStrom(boost::format("Taxa block %d in data file is not identical to first taxa block read") % (taxa_block_index+1));
            }
        }
        
        return ntax;
    }    

The storeData member function

This is a helper function used by the getDataFromFile member function (discussed below). It stores the data from a Nexus data or character block and ensures that the data type specified in any partition definition is consistent with the data type of the data/characters block. Details of how data is stored in _data_matrix are provided in the documentation for getDataFromFile.

    inline unsigned Data::storeData(unsigned ntax, unsigned nchar_before, NxsCharactersBlock * charBlock, NxsCharactersBlock::DataTypesEnum datatype) {    
        unsigned seqlen = 0;
        
        // Find the data type for the partition subset containing the first site in this NxsCharactersBlock
        // Assumes that all sites in any given NxsCharactersBlock have the same type (i.e. mixed not allowed)
        assert(_partition);
        unsigned subset_index = _partition-&gt;findSubsetForSite(nchar_before + 1); // remember that sites begin at 1, not 0, in partition definitions
        DataType dt = _partition-&gt;getDataTypeForSubset(subset_index);

        // Determine number of states and bail out if data type not handled
        // 1 = standard, 2 = dna, 3 = rna, 4 = nucleotide, 5 = protein, 6 = continuous, 7 = codon, 8 = mixed
        NxsCharactersBlock * block = charBlock;
        if (datatype == NxsCharactersBlock::dna || datatype == NxsCharactersBlock::rna || datatype == NxsCharactersBlock::nucleotide) {
            if (dt.isCodon()) {
                // Create a NxsCharactersBlock containing codons rather than nucleotides
                block = NxsCharactersBlock::NewCodonsCharactersBlock(
                    charBlock,
                    true,   // map partial ambiguities to completely missing (note: false is not yet implemented in NCL)
                    true,   // gaps to missing
                    true,   // inactive characters treated as missing
                    NULL,   // if non-NULL, specifies the indices of the positions in the gene
                    NULL);  // if non-NULL, specifies a pointer to a NxsCharactersBlock that contains all non-coding positions in gene
            }
            else {
                if (!dt.isNucleotide())
                    throw XStrom(boost::format("Partition subset has data type \"%s\" but data read from file has data type \"nucleotide\"") % dt.getDataTypeAsString());
            }
        }
        else if (datatype == NxsCharactersBlock::protein) {
            if (!dt.isProtein())
                throw XStrom(boost::format("Partition subset has data type \"%s\" but data read from file has data type \"protein\"") % dt.getDataTypeAsString());
        }
        else if (datatype == NxsCharactersBlock::standard) {
            if (!dt.isStandard())
                throw XStrom(boost::format("Partition subset has data type \"%s\" but data read from file has data type \"standard\"") % dt.getDataTypeAsString());
            assert(charBlock-&gt;GetSymbols());
            std::string symbols = std::string(charBlock-&gt;GetSymbols());
            dt.setStandardNumStates((unsigned)symbols.size());
        }
        else {
            // ignore block because data type is not one that is supported
            return nchar_before;
        }
        
        unsigned num_states = dt.getNumStates();
        
        // Make sure all states can be accommodated in a variable of type state_t   
        unsigned bits_in_state_t = 8*sizeof(state_t);
        if (num_states &gt; bits_in_state_t)
            throw XStrom(boost::format("This program can only process data types with fewer than %d states") % bits_in_state_t);   
        
        // Copy data matrix from NxsCharactersBlock object to _data_matrix
        // Loop through all taxa, processing one row from block for each taxon
        for (unsigned t = 0; t &lt; ntax; ++t) {

            const NxsDiscreteStateRow & row = block-&gt;GetDiscreteMatrixRow(t);
            if (seqlen == 0)
                seqlen = (unsigned)row.size();
            _data_matrix[t].resize(nchar_before + seqlen);
            
            // Loop through all sites/characters in row corresponding to taxon t
            unsigned k = nchar_before;
            for (int raw_state_code : row) {
                // For codon model, raw_state_code ranges from 0-63, but deletion of stop codons means fewer state codes
                state_t state = std::numeric_limits&lt;state_t&gt;::max(); // complete ambiguity, all bits set
                bool complete_ambiguity = (!dt.isCodon() && raw_state_code == (int)num_states);
                bool all_missing_or_gaps = (raw_state_code &lt; 0);
                if ((!complete_ambiguity) && (!all_missing_or_gaps)) {
                    int state_code = raw_state_code;
                    if (dt.isCodon())
                        state_code = dt.getGeneticCode()-&gt;getStateCode(raw_state_code);

                    if (state_code &lt; (int)num_states) {
                        state = (state_t)1 &lt;&lt; state_code;
                    }
                    else {
                        // incomplete ambiguity (NCL state code &gt; num_states)
                        const NxsDiscreteDatatypeMapper      * mapper = block-&gt;GetDatatypeMapperForChar(k - nchar_before);
                        const std::set&lt;NxsDiscreteStateCell&gt; & state_set = mapper-&gt;GetStateSetForCode(raw_state_code);
                        state = 0;
                        for (auto s : state_set) {
                             state |= (state_t)1 &lt;&lt; s;
                        }
                    }
                }
                _data_matrix[t][k++] = state;
            }
        }
        
        return seqlen;
    }    

The getDataFromFile member function

This member function makes use of the NCL (Nexus Class Library) to read and parse the data file. The advantage of using the NCL is that we do not have to deal with all the vagaries in format that people (and programs) use when creating NEXUS-formatted data file. The comment at the beginning of the function provides the web address of the primary documentation for the NCL. Please refer to this web site if you want a more complete explanation of the NCL functions used here. I will explain the salient features below the code block.

    inline void Data::getDataFromFile(const std::string filename) {    
        // See http://phylo.bio.ku.edu/ncldocs/v2.1/funcdocs/index.html for documentation
        //
        // -1 means "process all blocks found" (this is a bit field and -1 fills the bit field with 1s)
        // Here are the bits (and nexus blocks) that are defined:
        //     enum NexusBlocksToRead
        //     {
        //         NEXUS_TAXA_BLOCK_BIT = 0x01,
        //         NEXUS_TREES_BLOCK_BIT = 0x02,
        //         NEXUS_CHARACTERS_BLOCK_BIT = 0x04,
        //         NEXUS_ASSUMPTIONS_BLOCK_BIT = 0x08,
        //         NEXUS_SETS_BLOCK_BIT = 0x10,
        //         NEXUS_UNALIGNED_BLOCK_BIT = 0x20,
        //         NEXUS_DISTANCES_BLOCK_BIT = 0x40,
        //         NEXUS_UNKNOWN_BLOCK_BIT = 0x80
        //     };
        MultiFormatReader nexusReader(-1, NxsReader::WARNINGS_TO_STDERR);
        try {
            nexusReader.ReadFilepath(filename.c_str(), MultiFormatReader::NEXUS_FORMAT);
        }
        catch(...) {    
            nexusReader.DeleteBlocksFromFactories();
            throw;
        }   

        // Commit to storing new data
        clear();

        // Ensure that Data::setPartition was called before reading data
        assert(_partition);

        int numTaxaBlocks = nexusReader.GetNumTaxaBlocks();
        if (numTaxaBlocks == 0)
            throw XStrom("No taxa blocks were found in the data file");
            
        unsigned cum_nchar = 0; 
        for (int i = 0; i &lt; numTaxaBlocks; ++i) {
            NxsTaxaBlock * taxaBlock = nexusReader.GetTaxaBlock(i);
            unsigned ntax = storeTaxonNames(taxaBlock, i);
            const unsigned numCharBlocks = nexusReader.GetNumCharactersBlocks(taxaBlock);
            for (unsigned j = 0; j &lt; numCharBlocks; ++j) {
                NxsCharactersBlock * charBlock = nexusReader.GetCharactersBlock(taxaBlock, j);
                NxsCharactersBlock::DataTypesEnum datatype = charBlock-&gt;GetOriginalDataType();
                cum_nchar += storeData(ntax, cum_nchar, charBlock, datatype);
            }
        }   

        // No longer any need to store raw data from nexus file
        nexusReader.DeleteBlocksFromFactories();

        // Compress _data_matrix so that it holds only unique patterns (counts stored in _pattern_counts)
        if (_data_matrix.empty()) {
            std::cout &lt;&lt; "No data were stored from the file \"" &lt;&lt; filename &lt;&lt; "\"" &lt;&lt; std::endl;
            clear();
        }
        else {
            compressPatterns();
        }
    }    

We use the NCL’s MultiFormatReader class to create an object that can parse a NEXUS-formatted data file. MultiFormatReader is a class defined by the Nexus Class Library (NCL). Your program knows about it because of the #include "ncl/nxsmultiformat.h" at the top of the file.

The actual code for the functions and classes provided by the NCL will not be stored in your program’s executable file; it will be loaded on demand from a separate Dynamic Link Library (provided by the file libncl.so) after your program begins to execute.

The ReadFilepath function is used to read in the data file whose name is specified in the parameter filename. Note that we’ve supplied the argument MultiFormatReader::NEXUS_FORMAT to the ReadFilepath function: the NCL can read other common data file formats besides NEXUS, so you could modify your program to read (for example) FASTA-formatted data files by supplying a different argument (i.e. MultiFormatReader::FASTA_DNA_FORMAT). (A listing of supported formats may be found in the nxsmultiformat.h header file.)

The catch block,

        catch(...) {    
            nexusReader.DeleteBlocksFromFactories();
            throw;
        }   

is executed only if the NCL encountered a problem reading the file. The ellipsis (…) indicates that any exception thrown by ReadFilepath will be caught here. The catch code deletes all data stored thus far by calling the DeleteBlocksFromFactories() function and then re-throws the exception so that your program can catch and report the problem.

When the NCL reads a NEXUS data file, it looks for a DATA or CHARACTERS block and stores the data therein. If the creator of the data file has added a TAXA block, the NCL will associate the sequence data from the DATA or CHARACTERS block with the taxon names from that TAXA block. If no TAXA block was found in the file, the NCL will create a virtual TAXA block using the taxon names in the DATA or CHARACTERS block. In any case, our function first enumerates all TAXA blocks (real or vitual) found and then, for each TAXA block, we can enumerate all DATA or CHARACTER blocks stored under that TAXA block.

The main loop is shown below:

        unsigned cum_nchar = 0; 
        for (int i = 0; i &lt; numTaxaBlocks; ++i) {
            NxsTaxaBlock * taxaBlock = nexusReader.GetTaxaBlock(i);
            unsigned ntax = storeTaxonNames(taxaBlock, i);
            const unsigned numCharBlocks = nexusReader.GetNumCharactersBlocks(taxaBlock);
            for (unsigned j = 0; j &lt; numCharBlocks; ++j) {
                NxsCharactersBlock * charBlock = nexusReader.GetCharactersBlock(taxaBlock, j);
                NxsCharactersBlock::DataTypesEnum datatype = charBlock-&gt;GetOriginalDataType();
                cum_nchar += storeData(ntax, cum_nchar, charBlock, datatype);
            }
        }   

For the first TAXA block read, the taxon names are stored (by the storeTaxonNames member function) in the data member _taxon_names. For subsequent TAXA blocks, storeTaxonNames will simply check to make sure that taxon names are identical to those provided in the first TAXA block.

For each CHARACTER block read (DATA blocks will also be returned by the GetNumCharactersBlocks function), the data in the matrix is stored in the data member _data_matrix by the member function storeData.

How NCL stores data

The NCL stores DNA data not as A, C, G, and T, but instead as state codes: 0 for A, 1 for C, 2 for G, and 3 for T. An entry in the NEXUS data file that specifies complete ambiguity (e.g. N or {ACGT}) would be stored as 4. Missing data and gaps are stored as -1 and -2, respectively. Ambiguities that do not represent completely missing data are stored as a value larger than 4, and the function NxsDiscreteDatatypeMapper::GetStateSetForCode must be used to unpack this value into a set of states considered possible (see the code block for the storeData member function to see how this is done).

How our program stores data

Our program stores the state for each taxon/site combination as a bit field. Consider storing the DNA states A, C, G, T, ? (completely missing), R (i.e. a purine, either A or G), and Y (a pyrimidine, either C or T). We need only 4 bits (each holding a 0 or 1) to store any possible DNA state.

DNA  binary  decimal
 A    0001      1
 C    0010      2
 G    0100      4
 T    1000      8
 ?    1111     15
 R    0101      5
 Y    1010     10

Thus, if NCL stores 3 (for T), we will store 8 instead. If NCL stores 4 (missing data), we will store 15 instead. This allows us to handle any type of ambiguity thrown at us. Keep in mind, however, that our state codes are different than NCL’s state codes.

You will remember that our program allows a subset to contain codons rather than nucleotides. There are 64 possible codons, so in order to have a bit for each possible codon state, it is necessary to use a variable that occupies 64 bits of memory. For this reason, our program defines state_t to be unsigned long, which (conveniently) occupies 64 bits on at 64-bit computer. As a result, our program will only be able to accommodate data subsets of type codon if it is compiled on a 64-bit computer. The following lines inside storeData check to make sure that the number of states does not exceed the number of bits in a variable of type state_t. If so, an exception is thrown:

        // Make sure all states can be accommodated in a variable of type state_t   
        unsigned bits_in_state_t = 8*sizeof(state_t);
        if (num_states &gt; bits_in_state_t)
            throw XStrom(boost::format("This program can only process data types with fewer than %d states") % bits_in_state_t);   

Compressing the data

The final statement in this function calls the compressPatterns() member function, which eliminates duplicate patterns from _data_matrix and stores the counts of each pattern in _pattern_counts. If each sequence has 1000 nucleotides in the data file, then, before calling compressPatterns(), each row of _data_matrix would be a vector containing 1000 state codes and _pattern_counts would be a vector containing 1000 elements each equalling 1. Suppose that there are only 180 distinct site patterns. After calling compressPatterns(), each row of _data_matrix would now have only 180 state codes and _pattern_counts would be 180 elements long (and now many of these numbers are greater than 1).

Now that you understand how states are stored, this loop in the compressPatterns member function, which visits each state code (sc) in a given pattern (pc.first), can be explained:

        unsigned p = 0; 
        for (unsigned subset = 0; subset &lt; nsubsets; subset++) {
            for (auto & pc : _pattern_map_vect[subset]) {
                _pattern_counts[p] = pc.second; // record how many sites have pattern p
                _partition_key[p] = subset;     // record the subset to which pattern p belongs
                
                state_t constant_state = pc.first[0];
                unsigned t = 0;
                for (auto sc : pc.first) {
                    assert(sc &gt; 0);
                    constant_state &= sc;
                    _data_matrix[t][p] = sc;
                    ++t;
                }
                // constant_state equals 0 if polymorphic or state code of state present if monomorphic
                _monomorphic[p] = constant_state;
                ++p;
            }   

The local variable p keeps track of the pattern index over all subsets. Within each subset, pc is set in turn to each stored pair from _pattern_map_vect[subset]. The first member of each pair is a vector states representing the pattern itself; the second member of each pair is the count of the number of sites having that pattern.

For each pattern, sc is set in turn to each state in the pattern and stored in _data_matrix row t, column p, where t is the index of the state within the pattern (t stands for “taxon” because a pattern is just the state possessed by each taxon for a particular site). Before this sc loop begins, the local variable constant_state is set equal to the first state in the pattern. For each subsequent state, constant_state is updated using a bitwise AND operation. If constant_state makes it through all states in the pattern and is still greater than 0, it means that the pattern is at least potentially constant (a constant or monomorphic pattern is one in which all taxa have the same state).

For example, suppose the pattern for a 4-taxon problem is “AAAA”. Here is the succession of values possessed by constant_state (showing only the relevant 4 bits):

A 0001
A 0001 = 0001 & 0001
A 0001 = 0001 & 0001
A 0001 = 0001 & 0001

Note that constant_state equals 1 in the end. Now consider a variable pattern, “AAGG”:

A 0001
A 0001 = 0001 & 0001
G 0000 = 0001 & 0100
G 0000 = 0000 & 0100

Variable patterns set all bits to zero in constant_state. What about a pattern involving some missing data, “CC??”?

C 0010
C 0010 = 0010 & 0010
? 0010 = 0010 & 1111
? 0010 = 0010 & 1111

This pattern is considered constant for state C because all the unambiguous states are C. Consider one final example, “RR??”:

R 0101
R 0101 = 0101 & 0101
? 0101 = 0101 & 1111
? 0101 = 0101 & 1111

In this case, constant_state ends up being greater than 0, so it is not a variable site, yet there is no unambiguous state in the entire pattern. In this case, the pattern is consistant with a constant state containing all As or all Gs.

The element of _monomorphic corresponding to pattern index p is set to the final value of constant_state. Thus _monomorphic can later used to determine for which states this pattern is potentially constant. This information will be used later to compute the likelihood for the invariable sites rate heterogeneity model.

Releasing memory

The _pattern_map_vect data member is used by compressPatterns() but is cleared before that function returns. It serves as a temporary work space, and does not hold onto its contents once _data_matrix and _pattern_counts are rebuilt. For the same reason (we have all the data stored in our own data structures now), the function MultiFormatReader::DeleteBlocksFromFactories is called to delete data stored by NCL.