Create a new C++ class named Data
and add it to your project as the header file data.hpp. Below is the class declaration. The body of each member function will be described separately (you should add each of these member function bodies just above the right curly bracket that terminates the namespace block).
#pragma once
#include <fstream>
#include <regex>
#include <string>
#include <vector>
#include <numeric>
#include <limits>
#include <map>
#include <boost/format.hpp>
#include "genetic_code.hpp"
#include "datatype.hpp"
#include "partition.hpp"
#include "xstrom.hpp"
#include "ncl/nxsmultiformat.h"
#include <boost/algorithm/string/join.hpp>
namespace strom {
class Data {
public:
typedef std::vector<std::string> taxon_names_t;
typedef unsigned long long state_t;
typedef std::vector<state_t> pattern_vect_t;
typedef std::vector<state_t> monomorphic_vect_t;
typedef std::vector<int> partition_key_t;
typedef std::map<pattern_vect_t,unsigned> pattern_map_t;
typedef std::vector<pattern_vect_t> data_matrix_t;
typedef std::vector<pattern_map_t> pattern_map_vect_t;
typedef std::vector<double> pattern_counts_t;
typedef std::vector<unsigned> subset_end_t;
typedef std::vector<unsigned> npatterns_vect_t;
typedef std::pair<unsigned, unsigned> begin_end_pair_t;
typedef std::shared_ptr<Data> SharedPtr;
Data();
~Data();
Partition::SharedPtr getPartition();
void setPartition(Partition::SharedPtr partition);
void getDataFromFile(const std::string filename);
unsigned getNumSubsets() const;
std::string getSubsetName(unsigned subset) const;
unsigned getNumTaxa() const;
const taxon_names_t & getTaxonNames() const;
unsigned getNumPatterns() const;
npatterns_vect_t calcNumPatternsVect() const;
unsigned getNumPatternsInSubset(unsigned subset) const;
unsigned getNumStatesForSubset(unsigned subset) const;
unsigned calcSeqLen() const;
unsigned calcSeqLenInSubset(unsigned subset) const;
const data_matrix_t & getDataMatrix() const;
begin_end_pair_t getSubsetBeginEnd(unsigned subset) const;
const pattern_counts_t & getPatternCounts() const;
const monomorphic_vect_t & getMonomorphic() const;
const partition_key_t & getPartitionKey() const;
void clear();
private:
unsigned storeTaxonNames(NxsTaxaBlock * taxaBlock, unsigned taxa_block_index);
unsigned storeData(unsigned ntax, unsigned nchar, NxsCharactersBlock * charBlock, NxsCharactersBlock::DataTypesEnum datatype);
unsigned buildSubsetSpecificMaps(unsigned ntaxa, unsigned seqlen, unsigned nsubsets);
void updatePatternMap(Data::pattern_vect_t & pattern, unsigned subset);
void compressPatterns();
Partition::SharedPtr _partition;
pattern_counts_t _pattern_counts;
monomorphic_vect_t _monomorphic;
partition_key_t _partition_key;
pattern_map_vect_t _pattern_map_vect;
taxon_names_t _taxon_names;
data_matrix_t _data_matrix;
subset_end_t _subset_end;
};
// Member function bodies go below here but above the right curly bracket that ends the namespace block
}
The class declaration above defines a number of types that are introduced to simplify member function prototypes and data member definitions that follow:
taxon_names_t
is used for variables that store a vector of taxon namesstate_t
is used for storing discrete character state codespattern_vect_t
is used for variables that store site patterns (vectors of state codes)monomorphic_vect_t
is used for variables that store information about the state present at each constant site patternpartition_key_t
is used for variables that map site patterns to partition subsetspattern_map_t
is used for variables that map pattern counts to site patternsdata_matrix_t
is used for variables that store vectors of site patternspattern_map_vect_t
is used for a vector of pattern maps, one for each subsetpattern_counts_t
is used for variables that store the number of sites having each data patternsubset_end_t
is used for a variable that stores (one past) the last pattern in each subsetnpatterns_vect_t
is used to store a vector of counts of the number of distinct data patterns in each subsetbegin_end_pair_t
is used to store a std::pair
containing the beginning and ending pattern (index into the columns of _data_matrix
) for a given subsetSharedPtr
is a shared pointer to an object of type Data
These type definitions make it simpler to define variables of these types and to pass such variables into functions. Most of these are used at the bottom of the class declaration to declare variables that will store information read from a data file (_pattern_counts
, _partition_key
, _pattern_map_vect
, _taxon_names
, _data_matrix
, and _subset_end
).
Here are the bodies of the constructor and destructor. As usual, the only thing that we have these functions do is to report when an object of the Data
class is created or destroyed, and these lines have been commented out (you can uncomment them at any time for debugging purposes).
inline Data::Data() {
//std::cout << "Creating a Data object" << std::endl;
clear();
}
inline Data::~Data() {
//std::cout << "Destroying a Data object" << std::endl;
}
The Data
class needs access to the partition information that the user supplied in the strom.conf file. This information is stored in a Partition
object, and this function provides a way to assign a Partition
object to this Data
object.
inline void Data::setPartition(Partition::SharedPtr partition) {
_partition = partition;
}
The next 3 member functions provide access: (1) to the Partition
object (getPartition
, through a shared pointer); (2) to the number of data subsets in the partition (getNumSubsets
); and (3) to the names assigned to the data subsets by the user (getSubsetName
).
inline Partition::SharedPtr Data::getPartition() {
return _partition;
}
inline unsigned Data::getNumSubsets() const {
return (_partition ? _partition->getNumSubsets() : 1);
}
inline std::string Data::getSubsetName(unsigned subset) const {
return _partition ? _partition->getSubsetName(subset) : std::string("default");
}
Next come 5 member functions (getPartitionKey
, getPatternCounts
, getMonomorphic
, getTaxonNames
, and getDataMatrix
) that are all accessors.
inline const Data::partition_key_t & Data::getPartitionKey() const {
return _partition_key;
}
inline const Data::pattern_counts_t & Data::getPatternCounts() const {
return _pattern_counts;
}
inline const Data::monomorphic_vect_t & Data::getMonomorphic() const {
return _monomorphic;
}
inline const Data::taxon_names_t & Data::getTaxonNames() const {
return _taxon_names;
}
inline const Data::data_matrix_t & Data::getDataMatrix() const {
return _data_matrix;
}
The getSubsetBeginEnd
function returns a std::pair
in which the first
and second
elements are the beginning and ending pattern index in _data_matrix
for the subset index supplied as an argument. The beginning index is the index to the first pattern in the focal subset, while the ending index is one beyond the last pattern in the subset (i.e. similar to C++ iterators).
inline Data::begin_end_pair_t Data::getSubsetBeginEnd(unsigned subset) const {
assert(_subset_end.size() > subset);
if (subset == 0)
return std::make_pair(0, _subset_end[0]);
else
return std::make_pair(_subset_end[subset-1], _subset_end[subset]);
}
Here is the body of the clear
function, which simply empties data members that are vectors or maps.
inline void Data::clear() {
_partition_key.clear();
_pattern_counts.clear();
_monomorphic.clear();
_pattern_map_vect.clear();
_taxon_names.clear();
_data_matrix.clear();
_subset_end.clear();
}
The function getNumPatterns
returns the total number of patterns stored in all subsets (which is simply the length of any row of the _data_matrix
vector after the data have been compressed into unique patterns and pattern counts).
inline unsigned Data::getNumPatterns() const {
if (_data_matrix.size() > 0)
return (unsigned)_data_matrix[0].size();
else
return 0;
}
The calcNumPatternsVect
function returns a vector containing the number of patterns in each subset. The name begins with “calc” to indicate that the return value is a vector that is constructed on the fly (and thus involves more work than simply returning a stored value).
inline Data::npatterns_vect_t Data::calcNumPatternsVect() const {
unsigned nsubsets = (unsigned)_subset_end.size();
std::vector<unsigned> num_patterns_vect(nsubsets, 0);
for (unsigned s = 0; s < nsubsets; s++)
num_patterns_vect[s] = getNumPatternsInSubset(s);
return num_patterns_vect;
}
The getNumStatesForSubset
function returns the number of states characterizing a specified data subset (e.g. 4 for nucleotide data, 61 for codons using the standard code, 20 for protein data, etc.).
inline unsigned Data::getNumStatesForSubset(unsigned subset) const {
DataType data_type = _partition->getDataTypeForSubset(subset);
return data_type.getNumStates();
}
The getNumPatternsInSubset
function returns the number of distinct patterns stored in a specified subset.
inline unsigned Data::getNumPatternsInSubset(unsigned subset) const {
assert(_subset_end.size() > subset);
return (unsigned)_subset_end[subset] - (subset == 0 ? 0 : _subset_end[subset-1]);
}
The getNumTaxa
function is an accessor that simply returns the number of taxa.
inline unsigned Data::getNumTaxa() const {
return (unsigned)_taxon_names.size();
}
The function calcSeqLen
returns the total number of sites stored, but it cannot simply return the number of elements in a row of the _data_matrix
vector because that would equal the number of unique patterns, not the number of sites, assuming that compressPatterns
has been called. The std::accumulate
function sums the values in a vector between two elements delimited by the iterators provided in the first two arguments. The vector begin accumulated is _pattern_counts
. The third argument supplies the initial value (zero in this case).
inline unsigned Data::calcSeqLen() const {
return std::accumulate(_pattern_counts.begin(), _pattern_counts.end(), 0);
}
The function calcSeqLenInSubset
returns the number of sites stored in the subset whose index is provided. We again use the std::accumulate
function, but need to provide it with the correct begin and end iterators. The indices representing the first and (one beyond the) last pattern in the subset is returned as the std::pair
x
by getSubsetBeginEnd
. The iterator to the beginning of the needed range is equal to _pattern_counts.begin()
plus x.first
. The end iterator is obtained by advancing _pattern_counts.begin()
by x.second
units.
For example, if the first subset contains 5 patterns, then the first pattern in the second subset is obtained by moving an iterator 5 positions away from the iterator positioned at _pattern_counts.begin()
. Note that _pattern_counts.begin()
is not moved at all if the focal subset is the first subset.
inline unsigned Data::calcSeqLenInSubset(unsigned subset) const {
begin_end_pair_t s = getSubsetBeginEnd(subset);
return std::accumulate(_pattern_counts.begin() + s.first, _pattern_counts.begin() + s.second, 0);
}
This function fills a map associating site patterns (keys) with counts (values). There is one such map constructed for each partition subset, and the data member _pattern_map_vect
is a vector of such pattern maps. A site pattern is a vector of states present in all taxa. For example, if there were 4 taxa, then AAAA would be a common site pattern, as would CCCC, GGGG, and TTTT. These constant site patterns (same state present in all taxa) are more common in most data sets than variable site patterns such as AAGG or ACGT. Because the likelihood is identical for two instances of the exact same site pattern, it makes sense to just store such patterns once (and calculate the likelihood for them once) but keep track of how many instances there were of the pattern in the original data set. Such pooling cannot be done across partition subsets, however, because normally some component of the model differs among subsets and thus the likelihood must be computed separately for the same site pattern in different subsets. Hence, we compress each data subset into a pattern map and store a separate map for each subset in _pattern_map_vect
.
The buildSubsetSpecificMaps
function first obtains the vector of site ranges from the Partition
object and clears _pattern_map_vect
in preparation for refilling it from scratch. Each tuple in the tuples
vector specifies a range of sites along with the index of the subset to which that range of sites belongs. The site
loop constructs a pattern (vector of ntaxa
states) from each site in the range and adds that pattern to the map stored at _pattern_map_vect[site_subset]
. The work of adding the pattern to the correct map is handle by the function updatePatternMap
(see below).
A final loop computes and stores the total number of site patterns across all data subsets in the variable npatterns
, which is the returned value of this function.
inline unsigned Data::buildSubsetSpecificMaps(unsigned ntaxa, unsigned seqlen, unsigned nsubsets) {
pattern_vect_t pattern(ntaxa);
_pattern_map_vect.clear();
_pattern_map_vect.resize(nsubsets);
const Partition::partition_t & tuples = _partition->getSubsetRangeVect();
for (auto & t : tuples) {
unsigned site_begin = std::get<0>(t);
unsigned site_end = std::get<1>(t);
unsigned site_skip = std::get<2>(t);
unsigned site_subset = std::get<3>(t);
for (unsigned site = site_begin; site <= site_end; site += site_skip) {
// Copy site into pattern
for (unsigned taxon = 0; taxon < ntaxa; ++taxon) {
pattern[taxon] = _data_matrix[taxon][site-1];
}
// Add this pattern to _pattern_map_vect element corresponding to subset site_subset
updatePatternMap(pattern, site_subset);
}
}
// Tally total number of patterns across all subsets
unsigned npatterns = 0;
for (auto & map : _pattern_map_vect) {
npatterns += (unsigned)map.size();
}
return npatterns;
}
This private member function adds 1 to the count stored for the supplied pattern in the supplied partition subset in the _pattern_map_vect
vector. The function first asks (using the lower_bound
function of the standard map
class) what is the first key that equals pattern
? If there is no entry for the supplied pattern
, the lower_bound
function will return _pattern_map_vect[subset].end()
, in which case pattern
will be added as a key to _pattern_map_vect[subset]
with the value initially set to 1 (because this is the first time this pattern has been seen in this subset). If the lower_bound
function returns an iterator not equal to _pattern_map_vect[subset].end()
, then the iterator is situated at an actual map item and the count associated with this key is incremented by 1.
inline void Data::updatePatternMap(Data::pattern_vect_t & pattern, unsigned subset) {
// If pattern is not already in pattern_map, insert it and set value to 1.
// If it does exist, increment its current value.
// (see item 24, p. 110, in Meyers' Efficient STL for more info on the technique used here)
pattern_map_t::iterator lowb = _pattern_map_vect[subset].lower_bound(pattern);
if (lowb != _pattern_map_vect[subset].end() && !(_pattern_map_vect[subset].key_comp()(pattern, lowb->first))) {
// this pattern has already been seen
lowb->second += 1;
}
else
{
// this pattern has not yet been seen
_pattern_map_vect[subset].insert(lowb, pattern_map_t::value_type(pattern, 1));
}
}
The member function compressPatterns
calls buildSubsetSpecificMaps
to store all data currently in _data_matrix
in _pattern_map_vect
, then it replaces the contents of _data_matrix
with just the unique patterns and stores the number of sites exhibiting each pattern in the _pattern_counts
data member. Once all the data has been thus compressed into unique site patterns and their counts, the _pattern_map_vect
is cleared as it is now storing redundant information.
You will notice that before any real work is done, some sanity checks are performed. The first check is to make sure there is data stored (it makes no sense to compress an empty data matrix). Second, the number of stored sites is obtained and used to finalize the Partition
object. The call to Partition::finalize
performs other checks, such as confirming that the number of sites specified by the user-defined data partition is the same as the number of sites stored in _data_matrix
, and that no sites have failed to be assigned to any partition subset.
Some of the details (e.g. the determination of which patterns are constant) will be postponed until the way state codes are stored is explained.
inline void Data::compressPatterns() {
// Perform sanity checks
if (_data_matrix.empty())
throw XStrom("Attempted to compress an empty data matrix");
unsigned ntaxa = (unsigned)_data_matrix.size();
unsigned seqlen = (unsigned)_data_matrix[0].size();
// Finalize partition
unsigned nsubsets = getNumSubsets();
_subset_end.resize(nsubsets);
_partition->finalize(seqlen);
// Compact the data, storing it in _pattern_map_vect
unsigned npatterns = buildSubsetSpecificMaps(ntaxa, seqlen, nsubsets);
_pattern_counts.assign(npatterns, 0);
_monomorphic.assign(npatterns, 0);
_partition_key.assign(npatterns, -1);
// Rebuild _data_matrix to hold compact data, storing counts in _pattern_counts
_data_matrix.resize(ntaxa);
for (auto & row : _data_matrix) {
row.resize(npatterns);
}
unsigned p = 0;
for (unsigned subset = 0; subset < nsubsets; subset++) {
for (auto & pc : _pattern_map_vect[subset]) {
_pattern_counts[p] = pc.second; // record how many sites have pattern p
_partition_key[p] = subset; // record the subset to which pattern p belongs
state_t constant_state = pc.first[0];
unsigned t = 0;
for (auto sc : pc.first) {
assert(sc > 0);
constant_state &= sc;
_data_matrix[t][p] = sc;
++t;
}
// constant_state equals 0 if polymorphic or state code of state present if monomorphic
_monomorphic[p] = constant_state;
++p;
}
_subset_end[subset] = p;
// Everything for this subset has been transferred to _data_matrix and _pattern_counts,
// so we can now free this memory
_pattern_map_vect[subset].clear();
}
}
This is a helper function used by the getDataFromFile
member function (discussed below). It stores the taxon labels found in a Nexus taxa block in the vector _taxon_names
. It is possible for there to be more than one taxa block in a data file. If more than one taxa block is encountered, this function checks that the order and labels for taxa in the second and subsequent taxa blocks are identical to those stored from the first taxa block. If not, an XStrom
exception is thrown.
inline unsigned Data::storeTaxonNames(NxsTaxaBlock * taxaBlock, unsigned taxa_block_index) {
unsigned ntax = 0;
if (taxa_block_index == 0) {
// First taxa block encountered in the file
_taxon_names.clear();
for (auto s : taxaBlock->GetAllLabels())
_taxon_names.push_back(s);
ntax = (unsigned)_taxon_names.size();
_data_matrix.resize(ntax);
}
else {
// Second (or later) taxa block encountered in the file
// Check to ensure taxa block is identical to the first one
for (auto s : taxaBlock->GetAllLabels()) {
if (_taxon_names[ntax++] != s)
throw XStrom(boost::format("Taxa block %d in data file is not identical to first taxa block read") % (taxa_block_index+1));
}
}
return ntax;
}
This is a helper function used by the getDataFromFile
member function (discussed below). It stores the data from a Nexus data or character block and ensures that the data type specified in any partition definition is consistent with the data type of the data/characters block. Details of how data is stored in _data_matrix
are provided in the documentation for getDataFromFile
.
inline unsigned Data::storeData(unsigned ntax, unsigned nchar_before, NxsCharactersBlock * charBlock, NxsCharactersBlock::DataTypesEnum datatype) {
unsigned seqlen = 0;
// Find the data type for the partition subset containing the first site in this NxsCharactersBlock
// Assumes that all sites in any given NxsCharactersBlock have the same type (i.e. mixed not allowed)
assert(_partition);
unsigned subset_index = _partition->findSubsetForSite(nchar_before + 1); // remember that sites begin at 1, not 0, in partition definitions
DataType dt = _partition->getDataTypeForSubset(subset_index);
// Determine number of states and bail out if data type not handled
// 1 = standard, 2 = dna, 3 = rna, 4 = nucleotide, 5 = protein, 6 = continuous, 7 = codon, 8 = mixed
NxsCharactersBlock * block = charBlock;
if (datatype == NxsCharactersBlock::dna || datatype == NxsCharactersBlock::rna || datatype == NxsCharactersBlock::nucleotide) {
if (dt.isCodon()) {
// Create a NxsCharactersBlock containing codons rather than nucleotides
block = NxsCharactersBlock::NewCodonsCharactersBlock(
charBlock,
true, // map partial ambiguities to completely missing (note: false is not yet implemented in NCL)
true, // gaps to missing
true, // inactive characters treated as missing
NULL, // if non-NULL, specifies the indices of the positions in the gene
NULL); // if non-NULL, specifies a pointer to a NxsCharactersBlock that contains all non-coding positions in gene
}
else {
if (!dt.isNucleotide())
throw XStrom(boost::format("Partition subset has data type \"%s\" but data read from file has data type \"nucleotide\"") % dt.getDataTypeAsString());
}
}
else if (datatype == NxsCharactersBlock::protein) {
if (!dt.isProtein())
throw XStrom(boost::format("Partition subset has data type \"%s\" but data read from file has data type \"protein\"") % dt.getDataTypeAsString());
}
else if (datatype == NxsCharactersBlock::standard) {
if (!dt.isStandard())
throw XStrom(boost::format("Partition subset has data type \"%s\" but data read from file has data type \"standard\"") % dt.getDataTypeAsString());
assert(charBlock->GetSymbols());
std::string symbols = std::string(charBlock->GetSymbols());
dt.setStandardNumStates((unsigned)symbols.size());
}
else {
// ignore block because data type is not one that is supported
return nchar_before;
}
unsigned num_states = dt.getNumStates();
// Make sure all states can be accommodated in a variable of type state_t
unsigned bits_in_state_t = 8*sizeof(state_t);
if (num_states > bits_in_state_t)
throw XStrom(boost::format("This program can only process data types with fewer than %d states") % bits_in_state_t);
// Copy data matrix from NxsCharactersBlock object to _data_matrix
// Loop through all taxa, processing one row from block for each taxon
for (unsigned t = 0; t < ntax; ++t) {
const NxsDiscreteStateRow & row = block->GetDiscreteMatrixRow(t);
if (seqlen == 0)
seqlen = (unsigned)row.size();
_data_matrix[t].resize(nchar_before + seqlen);
// Loop through all sites/characters in row corresponding to taxon t
unsigned k = nchar_before;
for (int raw_state_code : row) {
// For codon model, raw_state_code ranges from 0-63, but deletion of stop codons means fewer state codes
state_t state = std::numeric_limits<state_t>::max(); // complete ambiguity, all bits set
bool complete_ambiguity = (!dt.isCodon() && raw_state_code == (int)num_states);
bool all_missing_or_gaps = (raw_state_code < 0);
if ((!complete_ambiguity) && (!all_missing_or_gaps)) {
int state_code = raw_state_code;
if (dt.isCodon())
state_code = dt.getGeneticCode()->getStateCode(raw_state_code);
if (state_code < (int)num_states) {
state = (state_t)1 << state_code;
}
else {
// incomplete ambiguity (NCL state code > num_states)
const NxsDiscreteDatatypeMapper * mapper = block->GetDatatypeMapperForChar(k - nchar_before);
const std::set<NxsDiscreteStateCell> & state_set = mapper->GetStateSetForCode(raw_state_code);
state = 0;
for (auto s : state_set) {
state |= (state_t)1 << s;
}
}
}
_data_matrix[t][k++] = state;
}
}
return seqlen;
}
This member function makes use of the NCL (Nexus Class Library) to read and parse the data file. The advantage of using the NCL is that we do not have to deal with all the vagaries in format that people (and programs) use when creating NEXUS-formatted data file. The comment at the beginning of the function provides the web address of the primary documentation for the NCL. Please refer to this web site if you want a more complete explanation of the NCL functions used here. I will explain the salient features below the code block.
inline void Data::getDataFromFile(const std::string filename) {
// See http://phylo.bio.ku.edu/ncldocs/v2.1/funcdocs/index.html for documentation
//
// -1 means "process all blocks found" (this is a bit field and -1 fills the bit field with 1s)
// Here are the bits (and nexus blocks) that are defined:
// enum NexusBlocksToRead
// {
// NEXUS_TAXA_BLOCK_BIT = 0x01,
// NEXUS_TREES_BLOCK_BIT = 0x02,
// NEXUS_CHARACTERS_BLOCK_BIT = 0x04,
// NEXUS_ASSUMPTIONS_BLOCK_BIT = 0x08,
// NEXUS_SETS_BLOCK_BIT = 0x10,
// NEXUS_UNALIGNED_BLOCK_BIT = 0x20,
// NEXUS_DISTANCES_BLOCK_BIT = 0x40,
// NEXUS_UNKNOWN_BLOCK_BIT = 0x80
// };
MultiFormatReader nexusReader(-1, NxsReader::WARNINGS_TO_STDERR);
try {
nexusReader.ReadFilepath(filename.c_str(), MultiFormatReader::NEXUS_FORMAT);
}
catch(...) {
nexusReader.DeleteBlocksFromFactories();
throw;
}
// Commit to storing new data
clear();
// Ensure that Data::setPartition was called before reading data
assert(_partition);
int numTaxaBlocks = nexusReader.GetNumTaxaBlocks();
if (numTaxaBlocks == 0)
throw XStrom("No taxa blocks were found in the data file");
unsigned cum_nchar = 0;
for (int i = 0; i < numTaxaBlocks; ++i) {
NxsTaxaBlock * taxaBlock = nexusReader.GetTaxaBlock(i);
unsigned ntax = storeTaxonNames(taxaBlock, i);
const unsigned numCharBlocks = nexusReader.GetNumCharactersBlocks(taxaBlock);
for (unsigned j = 0; j < numCharBlocks; ++j) {
NxsCharactersBlock * charBlock = nexusReader.GetCharactersBlock(taxaBlock, j);
NxsCharactersBlock::DataTypesEnum datatype = charBlock->GetOriginalDataType();
cum_nchar += storeData(ntax, cum_nchar, charBlock, datatype);
}
}
// No longer any need to store raw data from nexus file
nexusReader.DeleteBlocksFromFactories();
// Compress _data_matrix so that it holds only unique patterns (counts stored in _pattern_counts)
if (_data_matrix.empty()) {
std::cout << "No data were stored from the file \"" << filename << "\"" << std::endl;
clear();
}
else {
compressPatterns();
}
}
We use the NCL’s MultiFormatReader
class to create an object that can parse a NEXUS-formatted data file. MultiFormatReader
is a class defined by the Nexus Class Library (NCL). Your program knows about it because of the #include "ncl/nxsmultiformat.h"
at the top of the file.
The actual code for the functions and classes provided by the NCL will not be stored in your program’s executable file; it will be loaded on demand from a separate Dynamic Link Library (provided by the file libncl.so) after your program begins to execute.
The ReadFilepath
function is used to read in the data file whose name is specified in the parameter filename
. Note that we’ve supplied the argument MultiFormatReader::NEXUS_FORMAT
to the ReadFilepath
function: the NCL can read other common data file formats besides NEXUS, so you could modify your program to read (for example) FASTA-formatted data files by supplying a different argument (i.e. MultiFormatReader::FASTA_DNA_FORMAT
). (A listing of supported formats may be found in the nxsmultiformat.h header file.)
The catch block,
catch(...) {
nexusReader.DeleteBlocksFromFactories();
throw;
}
is executed only if the NCL encountered a problem reading the file. The ellipsis (…) indicates that any exception thrown by ReadFilepath
will be caught here. The catch
code deletes all data stored thus far by calling the DeleteBlocksFromFactories()
function and then re-throws the exception so that your program can catch and report the problem.
When the NCL reads a NEXUS data file, it looks for a DATA or CHARACTERS block and stores the data therein. If the creator of the data file has added a TAXA block, the NCL will associate the sequence data from the DATA or CHARACTERS block with the taxon names from that TAXA block. If no TAXA block was found in the file, the NCL will create a virtual TAXA block using the taxon names in the DATA or CHARACTERS block. In any case, our function first enumerates all TAXA blocks (real or vitual) found and then, for each TAXA block, we can enumerate all DATA or CHARACTER blocks stored under that TAXA block.
The main loop is shown below:
unsigned cum_nchar = 0;
for (int i = 0; i < numTaxaBlocks; ++i) {
NxsTaxaBlock * taxaBlock = nexusReader.GetTaxaBlock(i);
unsigned ntax = storeTaxonNames(taxaBlock, i);
const unsigned numCharBlocks = nexusReader.GetNumCharactersBlocks(taxaBlock);
for (unsigned j = 0; j < numCharBlocks; ++j) {
NxsCharactersBlock * charBlock = nexusReader.GetCharactersBlock(taxaBlock, j);
NxsCharactersBlock::DataTypesEnum datatype = charBlock->GetOriginalDataType();
cum_nchar += storeData(ntax, cum_nchar, charBlock, datatype);
}
}
For the first TAXA block read, the taxon names are stored (by the storeTaxonNames
member function) in the data member _taxon_names
. For subsequent TAXA blocks, storeTaxonNames
will simply check to make sure that taxon names are identical to those provided in the first TAXA block.
For each CHARACTER block read (DATA blocks will also be returned by the GetNumCharactersBlocks
function), the data in the matrix is stored in the data member _data_matrix
by the member function storeData
.
The NCL stores DNA data not as A
, C
, G
, and T
, but instead as state codes: 0 for A
, 1 for C
, 2 for G
, and 3 for T
. An entry in the NEXUS data file that specifies complete ambiguity (e.g. N
or {ACGT}
) would be stored as 4. Missing data and gaps are stored as -1 and -2, respectively. Ambiguities that do not represent completely missing data are stored as a value larger than 4, and the function NxsDiscreteDatatypeMapper::GetStateSetForCode
must be used to unpack this value into a set of states considered possible (see the code block for the storeData
member function to see how this is done).
Our program stores the state for each taxon/site combination as a bit field. Consider storing the DNA states A, C, G, T, ? (completely missing), R (i.e. a purine, either A or G), and Y (a pyrimidine, either C or T). We need only 4 bits (each holding a 0 or 1) to store any possible DNA state.
DNA binary decimal
A 0001 1
C 0010 2
G 0100 4
T 1000 8
? 1111 15
R 0101 5
Y 1010 10
Thus, if NCL stores 3 (for T), we will store 8 instead. If NCL stores 4 (missing data), we will store 15 instead. This allows us to handle any type of ambiguity thrown at us. Keep in mind, however, that our state codes are different than NCL’s state codes.
You will remember that our program allows a subset to contain codons rather than nucleotides. There are 64 possible codons, so in order to have a bit for each possible codon state, it is necessary to use a variable that occupies 64 bits of memory. For this reason, our program defines state_t
to be unsigned long, which (conveniently) occupies 64 bits on at 64-bit computer. As a result, our program will only be able to accommodate data subsets of type codon if it is compiled on a 64-bit computer. The following lines inside storeData
check to make sure that the number of states does not exceed the number of bits in a variable of type state_t
. If so, an exception is thrown:
// Make sure all states can be accommodated in a variable of type state_t
unsigned bits_in_state_t = 8*sizeof(state_t);
if (num_states > bits_in_state_t)
throw XStrom(boost::format("This program can only process data types with fewer than %d states") % bits_in_state_t);
The final statement in this function calls the compressPatterns()
member function, which eliminates duplicate patterns from _data_matrix
and stores the counts of each pattern in _pattern_counts
. If each sequence has 1000 nucleotides in the data file, then, before calling compressPatterns()
, each row of _data_matrix
would be a vector containing 1000 state codes and _pattern_counts
would be a vector containing 1000 elements each equalling 1. Suppose that there are only 180 distinct site patterns. After calling compressPatterns()
, each row of _data_matrix
would now have only 180 state codes and _pattern_counts
would be 180 elements long (and now many of these numbers are greater than 1).
Now that you understand how states are stored, this loop in the compressPatterns
member function, which visits each state code (sc
) in a given pattern (pc.first
), can be explained:
unsigned p = 0;
for (unsigned subset = 0; subset < nsubsets; subset++) {
for (auto & pc : _pattern_map_vect[subset]) {
_pattern_counts[p] = pc.second; // record how many sites have pattern p
_partition_key[p] = subset; // record the subset to which pattern p belongs
state_t constant_state = pc.first[0];
unsigned t = 0;
for (auto sc : pc.first) {
assert(sc > 0);
constant_state &= sc;
_data_matrix[t][p] = sc;
++t;
}
// constant_state equals 0 if polymorphic or state code of state present if monomorphic
_monomorphic[p] = constant_state;
++p;
}
The local variable p
keeps track of the pattern index over all subsets. Within each subset
, pc
is set in turn to each stored pair from _pattern_map_vect[subset]
. The first member of each pair is a vector states representing the pattern itself; the second member of each pair is the count of the number of sites having that pattern.
For each pattern, sc
is set in turn to each state in the pattern and stored in _data_matrix row t
, column p
, where t
is the index of the state within the pattern (t
stands for “taxon” because a pattern is just the state possessed by each taxon for a particular site). Before this sc
loop begins, the local variable constant_state
is set equal to the first state in the pattern. For each subsequent state, constant_state
is updated using a bitwise AND operation. If constant_state
makes it through all states in the pattern and is still greater than 0, it means that the pattern is at least potentially constant (a constant or monomorphic pattern is one in which all taxa have the same state).
For example, suppose the pattern for a 4-taxon problem is “AAAA”. Here is the succession of values possessed by constant_state
(showing only the relevant 4 bits):
A 0001
A 0001 = 0001 & 0001
A 0001 = 0001 & 0001
A 0001 = 0001 & 0001
Note that constant_state
equals 1 in the end. Now consider a variable pattern, “AAGG”:
A 0001
A 0001 = 0001 & 0001
G 0000 = 0001 & 0100
G 0000 = 0000 & 0100
Variable patterns set all bits to zero in constant_state
. What about a pattern involving some missing data, “CC??”?
C 0010
C 0010 = 0010 & 0010
? 0010 = 0010 & 1111
? 0010 = 0010 & 1111
This pattern is considered constant for state C because all the unambiguous states are C. Consider one final example, “RR??”:
R 0101
R 0101 = 0101 & 0101
? 0101 = 0101 & 1111
? 0101 = 0101 & 1111
In this case, constant_state
ends up being greater than 0, so it is not a variable site, yet there is no unambiguous state in the entire pattern. In this case, the pattern is consistant with a constant state containing all As or all Gs.
The element of _monomorphic
corresponding to pattern index p
is set to the final value of constant_state
. Thus _monomorphic
can later used to determine for which states this pattern is potentially constant. This information will be used later to compute the likelihood for the invariable sites rate heterogeneity model.
The _pattern_map_vect
data member is used by compressPatterns()
but is cleared before that function returns. It serves as a temporary work space, and does not hold onto its contents once _data_matrix
and _pattern_counts
are rebuilt. For the same reason (we have all the data stored in our own data structures now), the function MultiFormatReader::DeleteBlocksFromFactories
is called to delete data stored by NCL.