To test the new Partition
and Data
classes, we’ll need to modify strom.hpp to input partition subset specifications from the user. Begin by including some additional header files in strom.hpp and a shared pointer to a Partition
object in the Strom
class declaration:
#pragma once
#include <iostream>
<span style="color:#0000ff"><strong>#include "data.hpp"</strong></span>
#include "tree_summary.hpp"
<span style="color:#0000ff"><strong>#include "partition.hpp"</strong></span>
#include <boost/program_options.hpp>
<span style="color:#0000ff"><strong>#include <boost/filesystem.hpp></strong></span>
namespace strom {
class Strom {
public:
Strom();
~Strom();
void clear();
void processCommandLineOptions(int argc, const char * argv[]);
void run();
private:
std::string _data_file_name;
std::string _tree_file_name;
<span style="color:#0000ff"><strong>Partition::SharedPtr _partition;</strong></span>
<span style="color:#0000ff"><strong>Data::SharedPtr _data;</strong></span>
TreeSummary::SharedPtr _tree_summary;
static std::string _program_name;
static unsigned _major_version;
static unsigned _minor_version;
};
// member function bodies go here
}
Next, modify the Strom::processCommandLineOptions
function as follows:
inline void Strom::processCommandLineOptions(int argc, const char * argv[]) {
<span style="color:#0000ff"><strong>std::vector<std::string> partition_subsets;</strong></span>
boost::program_options::variables_map vm;
boost::program_options::options_description desc("Allowed options");
desc.add_options()
("help,h", "produce help message")
("version,v", "show program version")
<span style="color:#0000ff"><strong>("datafile,d", boost::program_options::value(&_data_file_name)->required(), "name of a data file in NEXUS format")</strong></span>
<span style="color:#0000ff"><strong>("treefile,t", boost::program_options::value(&_tree_file_name), "name of a tree file in NEXUS format")</strong></span>
<span style="color:#0000ff"><strong>("subset", boost::program_options::value(&partition_subsets), "a string defining a partition subset, e.g. 'first:1-1234\3' or 'default[codon:standard]:1-3702'")</strong></span>
;
boost::program_options::store(boost::program_options::parse_command_line(argc, argv, desc), vm);
try {
const boost::program_options::parsed_options & parsed = boost::program_options::parse_config_file< char >("strom.conf", desc, false);
boost::program_options::store(parsed, vm);
}
catch(boost::program_options::reading_file & x) {
std::cout << "Note: configuration file (strom.conf) not found" << std::endl;
}
boost::program_options::notify(vm);
// If user specified --help on command line, output usage summary and quit
if (vm.count("help") > 0) {
std::cout << desc << "\n";
std::exit(1);
}
// If user specified --version on command line, output version and quit
if (vm.count("version") > 0) {
std::cout << boost::str(boost::format("This is %s version %d.%d") % _program_name % _major_version % _minor_version) << std::endl;
std::exit(1);
}
<span style="color:#0000ff"><strong>// If user specified --subset on command line, break specified partition subset</strong></span>
<span style="color:#0000ff"><strong>// definition into name and character set string and add to _partition</strong></span>
<span style="color:#0000ff"><strong>if (vm.count("subset") > 0) {</strong></span>
<span style="color:#0000ff"><strong>_partition.reset(new Partition());</strong></span>
<span style="color:#0000ff"><strong>for (auto s : partition_subsets) {</strong></span>
<span style="color:#0000ff"><strong>_partition->parseSubsetDefinition(s);</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
}
Note that I’ve moved the required()
call in Strom::processCommandLineOptions
from the treefile command line option to the datafile command line option:
("datafile,d", boost::program_options::value(&_data_file_name)->required(), "name of a data file in NEXUS format")
("treefile,t", boost::program_options::value(&_tree_file_name), "name of a tree file in NEXUS format")
You could remove the required()
call from all command line options; I only use it here because this means program_options
takes care of informing the user if they forget to specify a data file on the command line or in the strom.conf file and, in this particular version, we are testing only whether our program can successfully read in data from a file; we are not testing anything about trees, so the user does not need to specify a tree file to test this version of the program.
The section at the bottom (highlighted in blue) that processes a subset specification provided by the user is worth a little explanation. We start by making sure the data member _partition
, which is a shared pointer to a Partition
, actually points to something. The shared_ptr
reset
function creates a new Partition
object and sets _partition
to point to it. You’ll notice that I’ve highlighted a line at the top of the function that declares a variable named partition_subsets
that is a vector of strings. This vector is provided to the options description for the subset option. If a single string variable were provided, only the last subset specified would be saved. However, because we specified a containiner, Boost Program Options will save all subset options specified by the user! We can thus iterate through this string vector, passing each of these strings to the Partition
parsSubsetDefinition
member function for processing.
Now add the following (highlighted) line to the Strom::clear
function:
inline void Strom::clear() {
_data_file_name = "";
_tree_file_name = "";
_tree_summary = nullptr;
<span style="color:#0000ff"><strong>_partition.reset(new Partition());</strong></span>
_data = nullptr;
}
This line ensures that a newly-constructed Strom
object has a default partition in place.
Before running it, however, you will need to create the file rbcL.nex. This file does not need to be in your project, so you can use any text editor to create it (e.g. BBEdit).
Here is a Nexus-formatted data file containing 60 sites from the large subunit of the gene encoding the enzyme RuBisCO (rbcL). Save this as the contents of rbcL.nex:
#NEXUS
[
Sites 76-135 (counting from the first site after the starting methionine triplet)
from selected taxa in the data used in: K Fučíková, PO Lewis, S Neupane, KG Karol,
and LA Lewis. 2019. Order, please! Uncertainty in the ordinal-level classification
of Chlorophyceae. PeerJ:e6899. DOI:10.7717/peerj.6899 (https://peerj.com/articles/6899/)
]
Begin data;
Dimensions ntax=14 nchar=60;
Format datatype=dna gap=-;
Matrix
Atractomorpha_echinata CCTGATTATGTTGTAAGAGACACTGATATTCTTGCTGCTTTCCGTATGACTCCTCAACCA
Bracteacoccus_aerius CCAGATTACGTAGTTAAAGATACTGATATTTTAGCTGCATTCCGTATGACTCCACAACCA
Bracteacoccus_minor CCAGATTACGTAGTTAAAGATACTGACATTTTAGCTGCATTCCGTATGACTCCACAACCA
Chlorotetraedron_incus CCTGATTACGTTATCAAAGATACTGATATTTTAGCAGCATTCCGTATGACTCCACAACCA
Chromochloris_zofingiensis CCTGATTACGTAGTTAAAGATACAGATATTTTAGCAGCTTTCCGTATGACTCCTCAACCA
Kirchneriella_aperta CCTGATTACGTAGTAAGAGAGACTGACATCTTAGCTGCATTCCGTATGACTCCACAACCA
Mychonastes_homosphaera CCAGATTACGTTGTTAAAGATACTGACATCTTAGCAGCATTCCGTATGACTCCACAACCA
Neochloris_aquatica CCAGATTATGTTGTAAAAGATACTGATATTTTAGCTGCATTCCGTATGACTCCTCAACCA
Ourococcus_multisporus CCTGATTACGTTGTAAAAGATACTGATATTTTAGCTGCATTCCGTATGACTCCACAACCA
Pediastrum_duplex CCAGATTATGTTGTAAAAGATACTGATATTTTAGCTGCATTCCGTATGACTCCTCAACCA
Pseudomuriella_schumacherensis CCTGATTACGTAGTAAAAGAAACAGACATTCTAGCTGCATTCCGTATGACTCCTCAACCA
Rotundella_rotunda CCAGATTACGTTGTAAAAGAAACTGATATTTTAGCAGCATTCCGTATGACTCCTCAACCA
Scenedesmus_obliquus CCAGATTACGTTGTAAAAGATACTGATATTTTAGCAGCATTCCGTATGACTCCACAACCA
Stigeoclonium_helveticum CCAGATTATATGGTTAAAGATACTGATATTCTTGCTGCTTTCCGTATGACTCCTCAACCT
;
end;
You will need to specify the data file and partition information either on the command line (using --datafile rbcL.nex --subset first:1-60\3 --subset second:2-60\3 --subset third:3-60\3
) or (better) create a text file named strom.conf containing these lines:
datafile = rbcL.nex
subset = first:1-60\3
subset = second:2-60\3
subset = third:3-60\3
We’ve already arranged for the data file to be in the working directory when strom is run, but users of your program may not know where strom is expecting data files to reside. You can help your future users by getting the program to tell you its current working directory when it starts up. We accomplish this by placing the following (highlighted) line just inside the Strom::run
function:
inline void Strom::run() {
std::cout << "Starting..." << std::endl;
<span style="color:#0000ff"><strong>std::cout << "Current working directory: " << boost::filesystem::current_path() << std::endl;</strong></span>
This line is what required us to compile the Boost file_system
and system
libraries. To use the current_path
function, you’ll note that we included the boost/filesystem.hpp header at the top of the strom.hpp file.
Replace the try
block in the Strom::run
function with the highlighted lines below (and also add the highlighted line reporting the current working directory):
inline void Strom::run() {
std::cout << "Starting..." << std::endl;
<span style="color:#0000ff"><strong>std::cout << "Current working directory: " << boost::filesystem::current_path() << std::endl;</strong></span>
<span style="color:#0000ff"><strong>try {</strong></span>
<span style="color:#0000ff"><strong>std::cout << "\n*** Reading and storing the data in the file " << _data_file_name << std::endl;</strong></span>
<span style="color:#0000ff"><strong>_data = Data::SharedPtr(new Data());</strong></span>
<span style="color:#0000ff"><strong>_data->setPartition(_partition);</strong></span>
<span style="color:#0000ff"><strong>_data->getDataFromFile(_data_file_name);</strong></span>
<span style="color:#0000ff"><strong></strong></span>
<span style="color:#0000ff"><strong>// Report information about data partition subsets</strong></span>
<span style="color:#0000ff"><strong>unsigned nsubsets = _data->getNumSubsets();</strong></span>
<span style="color:#0000ff"><strong>std::cout << "\nNumber of taxa: " << _data->getNumTaxa() << std::endl;</strong></span>
<span style="color:#0000ff"><strong>std::cout << "Number of partition subsets: " << nsubsets << std::endl;</strong></span>
<span style="color:#0000ff"><strong>for (unsigned subset = 0; subset < nsubsets; subset++) {</strong></span>
<span style="color:#0000ff"><strong>DataType dt = _partition->getDataTypeForSubset(subset);</strong></span>
<span style="color:#0000ff"><strong>std::cout << " Subset " << (subset+1) << " (" << _data->getSubsetName(subset) << ")" << std::endl;</strong></span>
<span style="color:#0000ff"><strong>std::cout << " data type: " << dt.getDataTypeAsString() << std::endl;</strong></span>
<span style="color:#0000ff"><strong>std::cout << " sites: " << _data->calcSeqLenInSubset(subset) << std::endl;</strong></span>
<span style="color:#0000ff"><strong>std::cout << " patterns: " << _data->getNumPatternsInSubset(subset) << std::endl;</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
<span style="color:#0000ff"><strong>}</strong></span>
catch (XStrom & x) {
std::cerr << "Strom encountered a problem:\n " << x.what() << std::endl;
}
std::cout << "\nFinished!" << std::endl;
}
The new code creates a Data
object, informs it of the partition set up by the user in the config file or command line, and then reads the data file whose name was provided by the user via the --datafile
command line option or datafile
config file option.
The _definitions
data member of the GeneticCode
class was declared static, so we need to initialize it outside of the GeneticCode
class in a source code (cpp) file. As main.cpp is the only source code file in our project (all others are header files), we must modify main.cpp to initialize _definitions.
#include <iostream>
#include "strom.hpp"
using namespace strom;
// static data member initializations
std::string Strom::_program_name = "strom";
unsigned Strom::_major_version = 1;
unsigned Strom::_minor_version = 0;
const double Node::_smallest_edge_length = 1.0e-12;
<span style="color:#0000ff"><strong>GeneticCode::genetic_code_definitions_t GeneticCode::_definitions = {</strong></span>
<span style="color:#0000ff"><strong>// codon order is alphabetical: i.e. AAA, AAC, AAG, AAT, ACA, ..., TTT</strong></span>
<span style="color:#0000ff"><strong>{"standard", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"vertmito", "KNKNTTTT*S*SMIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"yeastmito", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"moldmito", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"invertmito", "KNKNTTTTSSSSMIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"ciliate", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVVQYQYSSSS*CWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"echinomito", "NNKNTTTTSSSSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"euplotid", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSCCWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"plantplastid", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"altyeast", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLSLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"ascidianmito", "KNKNTTTTGSGSMIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"altflatwormmito", "NNKNTTTTSSSSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVVYY*YSSSSWCWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"blepharismamacro", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*YQYSSSS*CWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"chlorophyceanmito", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*YLYSSSS*CWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"trematodemito", "NNKNTTTTSSSSMIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSSWCWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"scenedesmusmito", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*YLY*SSS*CWCLFLF"},</strong></span>
<span style="color:#0000ff"><strong>{"thraustochytriummito", "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWC*FLF"}</strong></span>
<span style="color:#0000ff"><strong>};</strong></span>
int main(int argc, const char * argv[]) {
Strom strom;
try {
strom.processCommandLineOptions(argc, argv);
strom.run();
}
catch(std::exception & x) {
std::cerr << "Exception: " << x.what() << std::endl;
std::cerr << "Aborted." << std::endl;
}
catch(...) {
std::cerr << "Exception of unknown type!\n";
}
return 0;
}
After running the program, the output should contain these lines, indicating that there were 14 taxa, 3 partition subsets, 60 sites divided equally into each subset, and a total of 9 + 5 + 17 = 31 data patterns found in the rbcL.nex file:
Partition subset first comprises sites 1-60\3 and has type nucleotide
Partition subset second comprises sites 2-60\3 and has type nucleotide
Partition subset third comprises sites 3-60\3 and has type nucleotide
Starting...
Current working directory: "/home/pol02003/stromtutorial.github.io/steps/step-09/test"
*** Reading and storing the data in the file rbcL.nex
storing implied block: TAXA
storing read block: DATA
Number of taxa: 14
Number of partition subsets: 3
Subset 1 (first)
data type: nucleotide
sites: 20
patterns: 7
Subset 2 (second)
data type: nucleotide
sites: 20
patterns: 5
Subset 3 (third)
data type: nucleotide
sites: 20
patterns: 17
Finished!
To further test the data aspect of your program, create a file datatest.nex containing the DATA block from the rbcL.nex file plus two other blocks: (1) a CHARACTERS block containing the amino acid translation of the rbcL nucleotide data (as a test of reading protein data); and (2) a CHARACTERS block containing 5 (contrived) discrete morphological characters:
#nexus
begin data;
dimensions ntax=14 nchar=60;
format datatype=dna gap=-;
matrix
Atractomorpha_echinata CCTGATTATGTTGTAAGAGACACTGATATTCTTGCTGCTTTCCGTATGACTCCTCAACCA
Bracteacoccus_aerius CCAGATTACGTAGTTAAAGATACTGATATTTTAGCTGCATTCCGTATGACTCCACAACCA
Bracteacoccus_minor CCAGATTACCTAGTTAAAGATACTGACATTTTATCTGCATTCCGTATGACTCCACAACCA
Chlorotetraedron_incus CCTGATTACGTTATCAAAGATACTGATATTTTAGCAGCATTCCGTATGACTCCACAACCA
Chromochloris_zofingiensis CCTGATTACGTAGTTAAAGATACAGATATTTTAGCAGCTTTCCGTATGACTCCTCAACCA
Kirchneriella_aperta CCTGATTACGTAGTAAGAGAGACTGACATCTTAGCTGCATTCCGTATGACTCCACAACCA
Mychonastes_homosphaera CCAGATTACGTTGTTAAAGATACTGACATCTTAGCAGCATTCCGTATGACTCCACAACCA
Neochloris_aquatica CCAGATTATGTTGTAAAAGATACTGATATTTTAGCTGCATTCCGTATGACTCCTCAACCA
Ourococcus_multisporus CCTGATTACGTTGTAAAAGATACTGATATTTTAGCTGCATTCCGTATGACTCCACAACCA
Pediastrum_duplex CCAGATTATGTTGTAAAAGATACTGATATTTTAGCTGCATTCCGTATGACTCCTCAACCA
Pseudomuriella_schumacherensis CCTGATTACGTAGTAAAAGAAACAGACATTCTAGCTGCATTCCGTATGACTCCTCAACCA
Rotundella_rotunda CCAGATTACGTTGTAAAAGAAACTGATATTTTAGCAGCATTCCGTATGACTCCTCAACCA
Scenedesmus_obliquus CCA?ATTACGTTGTAAAAGATACTGATATTTTAGCAGCATTCCGTATGACTCCACAACCA
Stigeoclonium_helveticum CCAGATTATATGGTTAAAGATACTGATATTCTTGCTGCTTTCCGTATGACTCCTCAACCT
;
end;
begin characters;
dimensions nchar=20;
format datatype=protein;
matrix
Atractomorpha_echinata PDYVVRDTDILAAFRMTPQP
Bracteacoccus_aerius PDYVVKDTDILAAFRMTPQP
Bracteacoccus_minor PDYLVKDTDILSAFRMTPQP
Chlorotetraedron_incus PDYVIKDTDILAAFRMTPQP
Chromochloris_zofingiensis PDYVVKDTDILAAFRMTPQP
Kirchneriella_aperta PDYVVRETDILAAFRMTPQP
Mychonastes_homosphaera PDYVVKDTDILAAFRMTPQP
Neochloris_aquatica PDYVVKDTDILAAFRMTPQP
Ourococcus_multisporus PDYVVKDTDILAAFRMTPQP
Pediastrum_duplex PDYVVKDTDILAAFRMTPQP
Pseudomuriella_schumacherensis PDYVVKETDILAAFRMTPQP
Rotundella_rotunda PDYVVKETDILAAFRMTPQP
Scenedesmus_obliquus P?YVVKDTDILAAFRMTPQP
Stigeoclonium_helveticum PDYMVKDTDILAAFRMTPQP
;
end;
begin characters;
dimensions nchar=5;
format Datatype=standard symbols="0123";
matrix
Atractomorpha_echinata 01111
Bracteacoccus_aerius 11111
Bracteacoccus_minor 11111
Chlorotetraedron_incus 01111
Chromochloris_zofingiensis 01111
Kirchneriella_aperta 00111
Mychonastes_homosphaera 00111
Neochloris_aquatica 00011
Ourococcus_multisporus 00002
Pediastrum_duplex 00002
Pseudomuriella_schumacherensis 00003
Rotundella_rotunda 00003
Scenedesmus_obliquus 00003
Stigeoclonium_helveticum 00003
;
end;
Now modify your strom.conf file to look like this:
# datafile = rbcL.nex
# subset = first:1-60\3
# subset = second:2-60\3
# subset = third:3-60\3
datafile = datatest.nex
subset = rbcL[codon,plantplastid]:1-20
subset = rbcL[protein]:21-40
subset = morph[standard]:41-45
Note that I’ve commented out the commands used for reading rbcL.nex by placing a hash (#
) character at the beginning of each line. I’ve added a datafile
command to read datatest.nex and subset
commands to interpret the nucleotide data as codons, the protein data as type protein, and the morphological data as type standard. Note the site specification 1-20 rather than 1-60 for the first subset statement (storing 20 codons for each taxon rather than 60 nucleotides).
Running the program again should now yield the following output:
Partition subset rbcL comprises sites 1-20 and has type codon,plantplastid
Partition subset rbcL comprises sites 21-40 and has type protein
Partition subset morph comprises sites 41-45 and has type standard
Starting...
Current working directory: "/home/pol02003/stromtutorial.github.io/steps/step-09/test"
*** Reading and storing the data in the file datatest.nex
storing implied block: TAXA
storing read block: DATA
storing read block: CHARACTERS
storing read block: CHARACTERS
Number of taxa: 14
Number of partition subsets: 3
Subset 1 (rbcL)
data type: codon,plantplastid
sites: 20
patterns: 20
Subset 2 (rbcL)
data type: protein
sites: 20
patterns: 17
Subset 3 (morph)
data type: standard
sites: 5
patterns: 5
Finished!