The Split
class provides an object that can store the set of taxa that represent descendants of a given Node
object. We will provide a Split
object for every Node
.
Create a new class named Split
in a header file named split.hpp.
#pragma once
#include <vector>
#include <memory>
#include <set>
#include <map>
#include <climits>
#include <cassert>
namespace strom {
class Split {
public:
Split();
Split(const Split & other);
~Split();
Split & operator=(const Split & other);
bool operator==(const Split & other) const;
bool operator<(const Split & other) const;
void clear();
void resize(unsigned nleaves);
typedef unsigned long split_unit_t;
typedef std::vector<split_unit_t> split_t;
typedef std::set<Split> treeid_t;
typedef std::map< treeid_t, std::vector<unsigned> > treemap_t;
typedef std::tuple<unsigned,unsigned,unsigned> split_metrics_t;
split_unit_t getBits(unsigned unit_index) const;
bool getBitAt(unsigned leaf_index) const;
void setBitAt(unsigned leaf_index);
void addSplit(const Split & other);
bool isEquivalent(const Split & other) const;
bool isCompatible(const Split & other) const;
bool conflictsWith(const Split & other) const;
std::string createPatternRepresentation() const;
split_metrics_t getSplitMetrics() const;
private:
split_unit_t _mask;
split_t _bits;
unsigned _bits_per_unit;
unsigned _nleaves;
public:
typedef std::shared_ptr< Split > SharedPtr;
};
// member function bodies go here
}
Here are the bodies of the two constructors and the destructor. Copy these into the split.hpp header file just above the closing curly bracket defining the end of the strom
namespace.
inline Split::Split() {
_mask = 0L;
_nleaves = 0;
_bits_per_unit = (CHAR_BIT)*sizeof(Split::split_unit_t);
clear();
std::cout << "Constructing a Split" << std::endl;
}
inline Split::Split(const Split & other) {
_mask = other._mask;
_nleaves = other._nleaves;
_bits_per_unit = (CHAR_BIT)*sizeof(Split::split_unit_t);
_bits = other._bits;
std::cout << "Constructing a Split by copying an existing split" << std::endl;
}
inline Split::~Split() {
std::cout << "Destroying a Split" << std::endl;
}
The destructor for this class simply reports that a Split
object is being destroyed. There are two constructors, each of which sets the value of the _bits_per_unit
data member to the number of bits in the data type specified by split_unit_t
. The calculation involves multiplying the constant CHAR_BIT
(number of bits in a byte) by the number of bytes in the data type specified by split_unit_t
. (CHAR_BIT
is a constant provided in the climits header file.) This value will determine how many units of that data type will be needed to store a 0 or 1 for each taxon. These units are stored in the vector data member _bits
, which has type split_t
. (Besides split_unit_t
and split_t
, two other types are defined here, namely treeid_t
and treemap_t
, but I will postpone explaining these until later when they are used in the TreeSummary
class)
For example, suppose split_unit_t
is defined to be an integer of type char
, which is 8 bits in length. If there are 10 taxa, then 2 char
units will be required to store each split. A split that specifies that taxa having indices 0, 2, 6, and 9 are descendants of a node would be encoded as follows (note that bits within each unit are counted starting from the right, and because there are 10 taxa and 16 bits total, the leftmost 6 bits of unit 1 will always be set to 0):
6 2 0 9 <-- taxon index
|01000101|00000010| <-- _bits vector
| unit 0 | unit 1 |
In our case, split_unit_t
is defined to be an unsigned long integer
rather than a char
. On my computer, a char
comprises only 8 bits and is the smallest integer type, whereas an unsigned long
has 64 bits. Thus, we can get by with fewer units if we use unsigned long
rather than char
.
The first constructor defined above is the default constructor. It sets _nleaves
to 0 and calls the clear
function to set everything else to their default values. The second constructor is a copy constructor. It sets the _nleaves
data member to the value of _nleaves
in other
, the reference to which is supplied as an argument to the copy constructor. It also copies the _bits
vector from other
. As a result, the new split will be an exact copy of other
.
The clear
function simply sets all bits in all units of the _bits
vector to 0. The for loop assigns the reference u
, in turn, to each unit in _bits
. It is important that the variable u
is a reference: if we used for (auto u : _bits)
instead, then each unit would be copied to u
and any changes made to u
in the body of the loop would affect the copy but not the original!
The L
in 0L
specifies that we are assigning a long integer (long integers typically comprise twice as many bits as an ordinary integer) with value 0 to u
: there is little consequence if we forget the L
because the compiler would automatically convert 0
to 0L
for us, but it is always best to be as explicit as possible.
inline void Split::clear() {
for (auto & u : _bits) {
u = 0L;
}
}
We will eventually need to sort splits, and sorting algorithms need to be able to ask whether one Split
object is less than another Split
object, and also need to know if two Split
objects are identical. Special functions called operators are used to provide this functionality. Once these operators are defined, you can treat Split
objects like ordinary numbers with respect to the operators involved: for example, this code would work:
(Note: this is an example; do not add this code to split.hpp!)
Split a;
Split b;
if (a < b)
std::cout << "Split a is less than Split b";
else if (a == b)
std::cout << "Split a equals Split b";
else
std::cout << "Split a is greater than Split b";
The equals and less-than operators both compare the current object to another Split
object for which the reference other
is provided. Note that we are assuming that two Split
objects are equal if their _bits
vectors are equal, and one Split
object is less than another Split
object if its _bits
vector is less than other._bits
. For the less-than operator, we assert that the two _bits
vectors being compared have the same number of elements (the operator<
defined for standard containers such as std::vector
do not provide any warning if the containers being compared are of different lengths).
We also define an assignment operator that allows us to easily copy one Split
object to another. Note the difference between operator=
(assignment operator) and operator==
(equivalence operator): the first copies one object to another and returns a reference to the new copy, while the second asks whether one object is equal to another object and returns a bool (true or false) result.
inline Split & Split::operator=(const Split & other) {
_nleaves = other._nleaves;
_bits = other._bits;
return *this;
}
inline bool Split::operator==(const Split & other) const {
return (_bits == other._bits);
}
inline bool Split::operator<(const Split & other) const {
assert(_bits.size() == other._bits.size());
return (_bits < other._bits);
}
Given the number of taxa (_nleaves
below), we can use _bits_per_unit
to determine how many units the _bits
vector should have. For example, if _bits_per_unit
was 8, then up to 8 taxa could be represented by a _bits
vector of length 1; however 9 taxa would require the _bits
vector to have length 2. Dividing _nleaves - 1
by _bits_per_unit
and adding 1 provides the correct number of units. If _nleaves
= 8, (8-1)/8 = 0.875, which is truncated to 0 when converted to an integer, so adding 1 to (8-1)/8 yields the correct number of units (1). If _nleaves
= 9, (9-1)/8 = 1.0, which is truncated to 1 when converted to an integer, and adding 1 to (9-1)/8 yields the correct number of units (2). Once nunits
is computed, the function simply resizes _bits
to have length nunits
and then calls the clear
function to ensure that all bits in all units are set to 0.
inline void Split::resize(unsigned nleaves) {
_nleaves = nleaves;
unsigned nunits = 1 + ((nleaves - 1)/_bits_per_unit);
_bits.resize(nunits);
// create mask used to select only those bits used in final unit
unsigned num_unused_bits = nunits*_bits_per_unit - _nleaves;
unsigned num_used_bits = _bits_per_unit - num_unused_bits;
_mask = 0L;
split_unit_t unity = 1;
for (unsigned i = 0; i < num_used_bits; i++)
_mask |= (unity << i);
clear();
}
I’ll first show you the setBitAt
function, then explain it in detail below the code block:
inline void Split::setBitAt(unsigned leaf_index) {
unsigned unit_index = leaf_index/_bits_per_unit;
unsigned bit_index = leaf_index - unit_index*_bits_per_unit;
split_unit_t unity = 1;
split_unit_t bit_to_set = unity << bit_index;
_bits[unit_index] |= bit_to_set;
}
The way to set the 6th bit involves shifting the value 1 to the left 6 places using the bit shift operator. The bit shift operator is made up of two less-than symbols (<<
), and is the same operator that you have used to output text to an output stream such as std::cout
. Here are some examples assuming that the data type char (8 bits) is used:
(Note: this is an example; do not add this code to split.hpp!)
char zero = (char)0;
char one = (char)1;
zero looks like this: |00000000|
one looks like this: |00000001|
one << 0 produces |00000001|
one << 1 produces |00000010|
one << 2 produces |00000100|
one << 3 produces |00001000|
one << 4 produces |00010000|
one << 5 produces |00100000|
one << 6 produces |01000000|
one << 7 produces |10000000|
To set a specific bit in a value that may already have some bits set, use the bitwise OR operator:
(Note: this is an example; do not add this code to split.hpp!)
char x = 13;
x looks like this: |00001101|
char y = 1 << 4
y looks like this: |00010000|
char z = x | y;
z looks like this: |00011101|
x |= y;
x now looks like z: |00011101|
Note that x | y
simply sets all bits that are set in either x
or y
, and x |= y
does the same but assigns the result back to x
. Given that background, the function setBitAt
should now be understandable. Given a leaf_index
(where leaf_index
ranges from 0 to the number of taxa minus 1), the function first determines which unit to modify (unit_index
), then the specific bit within that unit to set (bit_index
), and finally uses the bit shift and bitwise OR operators to set the bit.
Note that, in creating the unity
variable, we cast the value 1 to be of type split_unit_t
before using the bit shift operator (<<
). This casting is important because the value 1 may be, by default, an integer type with fewer bits than split_unit_t
type. If 1 has only 32 bits, but split_unit_t
has 64, and we try 1 << 33
, then the program will set bits but the bits set will necessarily be less than 33 because there are only 32 bits available. Using a version of the value 1 that has 64 bits (by casting it to split_unit_t
) prevents us from shifting the 1 out of the memory allocated to store the value.
The getBits
member function simply returns the unit at the supplied unit_index
. This
accessor function is needed because the _bits
vector is a private data member.
inline Split::split_unit_t Split::getBits(unsigned unit_index) const {
assert(unit_index < _bits.size());
return _bits[unit_index];
}
The getBitAt
member function provides a way to query a split to find out if the bit
corresponding to a particular taxon is set. The taxon is specified using leaf_index
.
See the documentation for the createPatternRepresentation
function below for an explanation
of how this is determined.
inline bool Split::getBitAt(unsigned leaf_index) const {
unsigned unit_index = leaf_index/_bits_per_unit;
unsigned bit_index = leaf_index - unit_index*_bits_per_unit;
split_unit_t unity = 1;
split_unit_t bit_to_set = unity << bit_index;
return (bool)(_bits[unit_index] & bit_to_set);
}
This function creates a union of the taxon set defined by this Split
object and the taxon set defined by other
, then assigns that union to this Split
object. The for loop iterates through all elements of the _bits
vector and bitwise ORs that unit with the corresponding unit from other. This function is useful when constructing splits for interior nodes of a tree: the bits set in an interior node are the union of all sets corresponding to its descendant nodes.
inline void Split::addSplit(const Split & other) {
unsigned nunits = (unsigned)_bits.size();
assert(nunits == other._bits.size());
for (unsigned i = 0; i < nunits; ++i) {
_bits[i] |= other._bits[i];
}
}
This function creates a coded representation of a split in which *
represents bits that are “on” (i.e. set to 1) and -
represents “off” bits (i.e. set to 0).
Here is an example involving 10 taxa where bits_per_unit = 8
and taxa 1, 7 and 9 are “on” and taxa 0, 2-6, and 8 are “off”:
+--------+--------+
|10000010|00000010|
+--------+--------+
_bits[0] _bits[1]
Note the nasty tangle we have to navigate to write out a sensible split representation! First we read unit 0 from right to left: the 1 on the far left of unit 0 corresponds to the “on” bit for taxon 7. After unit 0 has been interpreted, move to unit 1: the rightmost bit in unit 1 (0) corresponds to the “off” bit for taxon 8. This split would be output as follows:
-*-----*-*
taxon 0 ^ ^ taxon 9
Note that we must be careful to avoid looking at the extra bits: the two units together contain 16 bits, yet there are only 10 taxa, so whatever is stored in the last 6 bits of _bits[1]
should not be considered. The algorithm works by letting i
iterate over units, letting j
iterate over the bits within each unit, and testing whether bit j
is set in _bits[i]
.
For example, suppose i=0. Shifting 1 to the left j places yields, as j takes on values from 0 to 7:
1 << j j
-------------
00000001 0
00000010 1
00000100 2
00001000 3
00010000 4
00100000 5
01000000 6
10000000 7
Now consider the result of a bitwise AND between 1 << j
(when j=7
) and the value of _bits[0]
:
10000000 1 << j, where j=7
10000010 unit[0]
-----------------------------------
10000000 (1 << 7) & unit[0]
The bitwise AND operator (&
) yields 1 for a bit position whenever the two numbers being compared both have a 1 in that position and produces 0 otherwise. The bitwise AND operator can thus be used to determine whether any particular bit position contains a 1 or 0.
inline std::string Split::createPatternRepresentation() const {
std::string s;
unsigned ntax_added = 0;
for (unsigned i = 0; i < _bits.size(); ++i) {
for (unsigned j = 0; j < _bits_per_unit; ++j) {
split_unit_t bitmask = ((split_unit_t)1 << j);
bool bit_is_set = ((_bits[i] & bitmask) > (split_unit_t)0);
if (bit_is_set)
s += '*';
else
s += '-';
if (++ntax_added == _nleaves)
break;
}
}
return s;
}
The isEquivalent
member function returns true if this split and other
are equivalent. Equivalence is less strict than equality. Two splits can be equivalent even if they are not equal because of the position of the root. For example, splits a
and b
are both equal and equivalent because a & b == a == b
:
split a: -***---*--
split b: -***---*--
a & b: -***---*-- (equals both a and b)
Splits c
and d
are not equal but they are nevertheless equivalent: they specify the same bipartition of taxa because one is the complement of the other. Thus, equivalence means c & ~d = c
and ~c & d = d
:
split c: -****-*---
split d: *----*-***
~c: *----*-***
~d: -****-*---
c & d: ----------
c & ~d: -****-*--- (equal to c)
~c & d: *----*-*** (equal to d)
Splits e
and f
, on the other hand, are neither equal nor equivalent because e & f
equals neither e
nor f
, e & ~f
does not equal e
, and ~e & f
does not equal f
:
split e: -***---*--
split f: ---***---*
e & f: ---*------ (equals neither e nor f)
e & ~f: -**----*-- (not equal to e)
~e & f: ----**---* (not equal to f)
Here is the body of the isEquivalent
function:
inline bool Split::isEquivalent(const Split & other) const {
unsigned nunits = (unsigned)_bits.size();
assert(nunits > 0);
// polarity 1 means root is on the same side of both splits
// polarity 2 means they are inverted relative to one another
unsigned polarity = 0;
for (unsigned i = 0; i < nunits; ++i) {
split_unit_t a = _bits[i];
split_unit_t b = other._bits[i];
bool a_equals_b = (a == b);
bool a_equals_inverse_b = (a == ~b);
if (i == nunits - 1) {
a_equals_inverse_b = (a == (~b & _mask));
}
bool ok = (a_equals_b || a_equals_inverse_b);
if (ok) {
if (polarity == 0) {
// First unit examined determines polarity
if (a_equals_b)
polarity = 1;
else
polarity = 2;
}
else {
// Polarity determined by first unit used for all subsequent units
if (polarity == 1 && !a_equals_b )
return false;
else if (polarity == 2 && !a_equals_inverse_b )
return false;
}
}
else {
return false;
}
}
// All of the units were equivalent, so that means the splits are equivalent
return true;
}
The isCompatible
member function returns true if this split and other
are compatible , which means that they can co-exist in the same tree. The two splits a
and b
are incompatible if a & b
is nonzero and also not equal to either a
or b
. For example, these two splits
split a: -***---*--
split b: ----***--*
a & b: ----------
are compatible, because a & b == 0
.
The two splits below are also compatible, even though a & b != 0
, because a & b == b
:
split a: -****-*---
split b: --**--*---
a & b: --**--*---
These two splits, on the other hand, are not compatible because a & b != 0
and a & b
is not equal to either a
or b
:
split a: -***---*--
split b: ---***---*
a & b: ---*------
Here is the body of the isCompatible
function:
inline bool Split::isCompatible(const Split & other) const {
for (unsigned i = 0; i < _bits.size(); ++i) {
split_unit_t a = _bits[i];
split_unit_t b = other._bits[i];
split_unit_t a_and_b = (a & b);
bool equals_a = (a_and_b == a);
bool equals_b = (a_and_b == b);
if (a_and_b && !(equals_a || equals_b)) {
// A failure of any unit to be compatible makes the entire split incompatible
return false;
}
}
// None of the units were incompatible, so that means the splits are compatible
return true;
}
The conflictsWith
member function simply calls isCompatible(other)
and returns the opposite.
inline bool Split::conflictsWith(const Split & other) const {
return !isCompatible(other);
}