7.1 The Split class

(Linux version)

< 7.0 | 7.1 | 7.2 >

The Split class provides an object that can store the set of taxa that represent descendants of a given Node object. We will provide a Split object for every Node.

Create a new class named Split in a header file named split.hpp.

#pragma once    

#include &lt;vector&gt;
#include &lt;memory&gt;
#include &lt;set&gt;
#include &lt;map&gt;
#include &lt;climits&gt;
#include &lt;cassert&gt;

namespace strom {

    class Split {
        public:
                                                                Split();
                                                                Split(const Split & other);
                                                                ~Split();

            Split &                                             operator=(const Split & other);
            bool                                                operator==(const Split & other) const;
            bool                                                operator&lt;(const Split & other) const;

            void                                                clear();
            void                                                resize(unsigned nleaves);

            typedef unsigned long                               split_unit_t;
            typedef std::vector&lt;split_unit_t&gt;                   split_t;
            typedef std::set&lt;Split&gt;                             treeid_t;
            typedef std::map&lt; treeid_t, std::vector&lt;unsigned&gt; &gt; treemap_t;
            typedef std::tuple&lt;unsigned,unsigned,unsigned&gt;      split_metrics_t;

            split_unit_t                                        getBits(unsigned unit_index) const;
            bool                                                getBitAt(unsigned leaf_index) const;
            void                                                setBitAt(unsigned leaf_index);
            void                                                addSplit(const Split & other);

            bool                                                isEquivalent(const Split & other) const;
            bool                                                isCompatible(const Split & other) const;
            bool                                                conflictsWith(const Split & other) const;

            std::string                                         createPatternRepresentation() const;
            split_metrics_t                                     getSplitMetrics() const;

        private:

            split_unit_t                                        _mask;
            split_t                                             _bits;
            unsigned                                            _bits_per_unit;
            unsigned                                            _nleaves;

        public:

            typedef std::shared_ptr&lt; Split &gt;                    SharedPtr;
    };

    // member function bodies go here
    
}   

Constructors and destructor

Here are the bodies of the two constructors and the destructor. Copy these into the split.hpp header file just above the closing curly bracket defining the end of the strom namespace.

    inline Split::Split() { 
        _mask = 0L;
        _nleaves = 0;
        _bits_per_unit = (CHAR_BIT)*sizeof(Split::split_unit_t);
        clear();
        std::cout &lt;&lt; "Constructing a Split" &lt;&lt; std::endl;
    } 

    inline Split::Split(const Split & other) {
        _mask = other._mask;
        _nleaves = other._nleaves;
        _bits_per_unit = (CHAR_BIT)*sizeof(Split::split_unit_t);
        _bits = other._bits;
        std::cout &lt;&lt; "Constructing a Split by copying an existing split" &lt;&lt; std::endl;
    }

    inline Split::~Split() {
        std::cout &lt;&lt; "Destroying a Split" &lt;&lt; std::endl;
    } 

The destructor for this class simply reports that a Split object is being destroyed. There are two constructors, each of which sets the value of the _bits_per_unit data member to the number of bits in the data type specified by split_unit_t. The calculation involves multiplying the constant CHAR_BIT (number of bits in a byte) by the number of bytes in the data type specified by split_unit_t. (CHAR_BIT is a constant provided in the climits header file.) This value will determine how many units of that data type will be needed to store a 0 or 1 for each taxon. These units are stored in the vector data member _bits, which has type split_t. (Besides split_unit_t and split_t, two other types are defined here, namely treeid_t and treemap_t, but I will postpone explaining these until later when they are used in the TreeSummary class)

For example, suppose split_unit_t is defined to be an integer of type char, which is 8 bits in length. If there are 10 taxa, then 2 char units will be required to store each split. A split that specifies that taxa having indices 0, 2, 6, and 9 are descendants of a node would be encoded as follows (note that bits within each unit are counted starting from the right, and because there are 10 taxa and 16 bits total, the leftmost 6 bits of unit 1 will always be set to 0):

  6   2 0       9   <-- taxon index
|01000101|00000010| <-- _bits vector
| unit 0 | unit 1 |

In our case, split_unit_t is defined to be an unsigned long integer rather than a char. On my computer, a char comprises only 8 bits and is the smallest integer type, whereas an unsigned long has 64 bits. Thus, we can get by with fewer units if we use unsigned long rather than char.

Default constructor versus copy constructor

The first constructor defined above is the default constructor. It sets _nleaves to 0 and calls the clear function to set everything else to their default values. The second constructor is a copy constructor. It sets the _nleaves data member to the value of _nleaves in other, the reference to which is supplied as an argument to the copy constructor. It also copies the _bits vector from other. As a result, the new split will be an exact copy of other.

The clear member function

The clear function simply sets all bits in all units of the _bits vector to 0. The for loop assigns the reference u, in turn, to each unit in _bits. It is important that the variable u is a reference: if we used for (auto u : _bits) instead, then each unit would be copied to u and any changes made to u in the body of the loop would affect the copy but not the original!

The L in 0L specifies that we are assigning a long integer (long integers typically comprise twice as many bits as an ordinary integer) with value 0 to u: there is little consequence if we forget the L because the compiler would automatically convert 0 to 0L for us, but it is always best to be as explicit as possible.

    inline void Split::clear() { 
        for (auto & u : _bits) {
            u = 0L;
        }
    } 

Operators

We will eventually need to sort splits, and sorting algorithms need to be able to ask whether one Split object is less than another Split object, and also need to know if two Split objects are identical. Special functions called operators are used to provide this functionality. Once these operators are defined, you can treat Split objects like ordinary numbers with respect to the operators involved: for example, this code would work:

(Note: this is an example; do not add this code to split.hpp!)

Split a;
Split b;
if (a < b)
  std::cout << "Split a is less than Split b";
else if (a == b)
  std::cout << "Split a equals Split b";
else
  std::cout << "Split a is greater than Split b";

The equals and less-than operators both compare the current object to another Split object for which the reference other is provided. Note that we are assuming that two Split objects are equal if their _bits vectors are equal, and one Split object is less than another Split object if its _bits vector is less than other._bits. For the less-than operator, we assert that the two _bits vectors being compared have the same number of elements (the operator< defined for standard containers such as std::vector do not provide any warning if the containers being compared are of different lengths).

We also define an assignment operator that allows us to easily copy one Split object to another. Note the difference between operator= (assignment operator) and operator== (equivalence operator): the first copies one object to another and returns a reference to the new copy, while the second asks whether one object is equal to another object and returns a bool (true or false) result.

    inline Split & Split::operator=(const Split & other) { 
        _nleaves = other._nleaves;
        _bits = other._bits;
        return *this;
    }

    inline bool Split::operator==(const Split & other) const { 
        return (_bits == other._bits);
    }

    inline bool Split::operator&lt;(const Split & other) const { 
        assert(_bits.size() == other._bits.size());
        return (_bits &lt; other._bits);
    } 

The resize member function

Given the number of taxa (_nleaves below), we can use _bits_per_unit to determine how many units the _bits vector should have. For example, if _bits_per_unit was 8, then up to 8 taxa could be represented by a _bits vector of length 1; however 9 taxa would require the _bits vector to have length 2. Dividing _nleaves - 1 by _bits_per_unit and adding 1 provides the correct number of units. If _nleaves = 8, (8-1)/8 = 0.875, which is truncated to 0 when converted to an integer, so adding 1 to (8-1)/8 yields the correct number of units (1). If _nleaves = 9, (9-1)/8 = 1.0, which is truncated to 1 when converted to an integer, and adding 1 to (9-1)/8 yields the correct number of units (2). Once nunits is computed, the function simply resizes _bits to have length nunits and then calls the clear function to ensure that all bits in all units are set to 0.

    inline void Split::resize(unsigned nleaves) { 
        _nleaves = nleaves;
        unsigned nunits = 1 + ((nleaves - 1)/_bits_per_unit);
        _bits.resize(nunits);

        // create mask used to select only those bits used in final unit
        unsigned num_unused_bits = nunits*_bits_per_unit - _nleaves;
        unsigned num_used_bits = _bits_per_unit - num_unused_bits;
        _mask = 0L;
        split_unit_t unity = 1;
        for (unsigned i = 0; i &lt; num_used_bits; i++)
            _mask |= (unity &lt;&lt; i);

        clear();
    } 

The setBitAt member function

I’ll first show you the setBitAt function, then explain it in detail below the code block:

    inline void Split::setBitAt(unsigned leaf_index) { 
        unsigned unit_index = leaf_index/_bits_per_unit;
        unsigned bit_index = leaf_index - unit_index*_bits_per_unit;
        split_unit_t unity = 1;
        split_unit_t bit_to_set = unity &lt;&lt; bit_index;
        _bits[unit_index] |= bit_to_set;
    } 

The way to set the 6th bit involves shifting the value 1 to the left 6 places using the bit shift operator. The bit shift operator is made up of two less-than symbols (<<), and is the same operator that you have used to output text to an output stream such as std::cout. Here are some examples assuming that the data type char (8 bits) is used:

(Note: this is an example; do not add this code to split.hpp!)

char zero = (char)0;
char one  = (char)1;
zero looks like this: |00000000|
one looks like this:  |00000001|
one << 0 produces |00000001|
one << 1 produces |00000010|
one << 2 produces |00000100|
one << 3 produces |00001000|
one << 4 produces |00010000|
one << 5 produces |00100000|
one << 6 produces |01000000|
one << 7 produces |10000000|

To set a specific bit in a value that may already have some bits set, use the bitwise OR operator:

(Note: this is an example; do not add this code to split.hpp!)

char x = 13;
x looks like this:     |00001101|
char y = 1 << 4
y looks like this:     |00010000|
char z = x | y;
z looks like this:     |00011101|
x |= y;
x now looks like z:    |00011101|

Note that x | y simply sets all bits that are set in either x or y, and x |= y does the same but assigns the result back to x. Given that background, the function setBitAt should now be understandable. Given a leaf_index (where leaf_index ranges from 0 to the number of taxa minus 1), the function first determines which unit to modify (unit_index), then the specific bit within that unit to set (bit_index), and finally uses the bit shift and bitwise OR operators to set the bit.

A subtle but important point about setting bits

Note that, in creating the unity variable, we cast the value 1 to be of type split_unit_t before using the bit shift operator (<<). This casting is important because the value 1 may be, by default, an integer type with fewer bits than split_unit_t type. If 1 has only 32 bits, but split_unit_t has 64, and we try 1 << 33, then the program will set bits but the bits set will necessarily be less than 33 because there are only 32 bits available. Using a version of the value 1 that has 64 bits (by casting it to split_unit_t) prevents us from shifting the 1 out of the memory allocated to store the value.

The getBits member function

The getBits member function simply returns the unit at the supplied unit_index. This accessor function is needed because the _bits vector is a private data member.

    inline Split::split_unit_t Split::getBits(unsigned unit_index) const { 
        assert(unit_index &lt; _bits.size());
        return _bits[unit_index];
    } 

The getBitAt member function

The getBitAt member function provides a way to query a split to find out if the bit corresponding to a particular taxon is set. The taxon is specified using leaf_index. See the documentation for the createPatternRepresentation function below for an explanation of how this is determined.

    inline bool Split::getBitAt(unsigned leaf_index) const { 
        unsigned unit_index = leaf_index/_bits_per_unit;
        unsigned bit_index = leaf_index - unit_index*_bits_per_unit;
        split_unit_t unity = 1;
        split_unit_t bit_to_set = unity &lt;&lt; bit_index;
        return (bool)(_bits[unit_index] & bit_to_set);
    } 

The addSplit member function

This function creates a union of the taxon set defined by this Split object and the taxon set defined by other, then assigns that union to this Split object. The for loop iterates through all elements of the _bits vector and bitwise ORs that unit with the corresponding unit from other. This function is useful when constructing splits for interior nodes of a tree: the bits set in an interior node are the union of all sets corresponding to its descendant nodes.

    inline void Split::addSplit(const Split & other) { 
        unsigned nunits = (unsigned)_bits.size();
        assert(nunits == other._bits.size());
        for (unsigned i = 0; i &lt; nunits; ++i) {
            _bits[i] |= other._bits[i];
        }
    } 

The createPatternRepresentation member function

This function creates a coded representation of a split in which * represents bits that are “on” (i.e. set to 1) and - represents “off” bits (i.e. set to 0).

Here is an example involving 10 taxa where bits_per_unit = 8 and taxa 1, 7 and 9 are “on” and taxa 0, 2-6, and 8 are “off”:

+--------+--------+
|10000010|00000010|
+--------+--------+
 _bits[0]  _bits[1]

Note the nasty tangle we have to navigate to write out a sensible split representation! First we read unit 0 from right to left: the 1 on the far left of unit 0 corresponds to the “on” bit for taxon 7. After unit 0 has been interpreted, move to unit 1: the rightmost bit in unit 1 (0) corresponds to the “off” bit for taxon 8. This split would be output as follows:

        -*-----*-*
taxon 0 ^        ^ taxon 9

Note that we must be careful to avoid looking at the extra bits: the two units together contain 16 bits, yet there are only 10 taxa, so whatever is stored in the last 6 bits of _bits[1] should not be considered. The algorithm works by letting i iterate over units, letting j iterate over the bits within each unit, and testing whether bit j is set in _bits[i].

For example, suppose i=0. Shifting 1 to the left j places yields, as j takes on values from 0 to 7:

 1 << j     j
-------------
00000001    0
00000010    1
00000100    2
00001000    3
00010000    4
00100000    5
01000000    6
10000000    7

Now consider the result of a bitwise AND between 1 << j (when j=7) and the value of _bits[0]:

10000000    1 << j, where j=7
10000010    unit[0]
-----------------------------------
10000000    (1 << 7) & unit[0]

The bitwise AND operator (&) yields 1 for a bit position whenever the two numbers being compared both have a 1 in that position and produces 0 otherwise. The bitwise AND operator can thus be used to determine whether any particular bit position contains a 1 or 0.

    inline std::string Split::createPatternRepresentation() const { 
        std::string s;
        unsigned ntax_added = 0;
        for (unsigned i = 0; i &lt; _bits.size(); ++i) {
            for (unsigned j = 0; j &lt; _bits_per_unit; ++j) {
                split_unit_t bitmask = ((split_unit_t)1 &lt;&lt; j);
                bool bit_is_set = ((_bits[i] & bitmask) &gt; (split_unit_t)0);
                if (bit_is_set)
                    s += '*';
                else
                    s += '-';
                if (++ntax_added == _nleaves)
                    break;
            }
        }
        return s;
    } 

The isEquivalent member function

The isEquivalent member function returns true if this split and other are equivalent. Equivalence is less strict than equality. Two splits can be equivalent even if they are not equal because of the position of the root. For example, splits a and b are both equal and equivalent because a & b == a == b:

split a: -***---*--
split b: -***---*--
  a & b: -***---*-- (equals both a and b)

Splits c and d are not equal but they are nevertheless equivalent: they specify the same bipartition of taxa because one is the complement of the other. Thus, equivalence means c & ~d = c and ~c & d = d:

split c: -****-*---
split d: *----*-***
     ~c: *----*-***
     ~d: -****-*---
  c & d: ----------
 c & ~d: -****-*--- (equal to c)
 ~c & d: *----*-*** (equal to d)

Splits e and f, on the other hand, are neither equal nor equivalent because e & f equals neither e nor f, e & ~f does not equal e, and ~e & f does not equal f:

split e: -***---*--
split f: ---***---*
  e & f: ---*------ (equals neither e nor f)
 e & ~f: -**----*-- (not equal to e)
 ~e & f: ----**---* (not equal to f)

Here is the body of the isEquivalent function:

    inline bool Split::isEquivalent(const Split & other) const { 
        unsigned nunits = (unsigned)_bits.size();
        assert(nunits &gt; 0);

        // polarity 1 means root is on the same side of both splits
        // polarity 2 means they are inverted relative to one another
        unsigned polarity = 0;
        for (unsigned i = 0; i &lt; nunits; ++i) {
            split_unit_t a = _bits[i];
            split_unit_t b = other._bits[i];
            bool a_equals_b = (a == b);
            bool a_equals_inverse_b = (a == ~b);
            if (i == nunits - 1) {
                a_equals_inverse_b = (a == (~b & _mask));
            }
            bool ok = (a_equals_b || a_equals_inverse_b);
            if (ok) {
                if (polarity == 0) {
                    // First unit examined determines polarity
                    if (a_equals_b)
                        polarity = 1;
                    else
                        polarity = 2;
                }
                else {
                    // Polarity determined by first unit used for all subsequent units
                    if (polarity == 1 && !a_equals_b )
                        return false;
                    else if (polarity == 2 && !a_equals_inverse_b )
                        return false;
                }
            }
            else {
                return false;
            }
        }

        // All of the units were equivalent, so that means the splits are equivalent
        return true;
    } 

The isCompatible member function

The isCompatible member function returns true if this split and other are compatible , which means that they can co-exist in the same tree. The two splits a and b are incompatible if a & b is nonzero and also not equal to either a or b. For example, these two splits

  split a: -***---*--
  split b: ----***--*
    a & b: ----------

are compatible, because a & b == 0.

The two splits below are also compatible, even though a & b != 0, because a & b == b:

  split a: -****-*---
  split b: --**--*---
    a & b: --**--*---

These two splits, on the other hand, are not compatible because a & b != 0 and a & b is not equal to either a or b:

  split a: -***---*--
  split b: ---***---*
    a & b: ---*------

Here is the body of the isCompatible function:

    inline bool Split::isCompatible(const Split & other) const { 
        for (unsigned i = 0; i &lt; _bits.size(); ++i) {
            split_unit_t a = _bits[i];
            split_unit_t b = other._bits[i];
            split_unit_t a_and_b = (a & b);
            bool equals_a   = (a_and_b == a);
            bool equals_b   = (a_and_b == b);
            if (a_and_b && !(equals_a || equals_b)) {
                // A failure of any unit to be compatible makes the entire split incompatible
                return false;
            }
        }

        // None of the units were incompatible, so that means the splits are compatible
        return true;
    }

The conflictsWith member function

The conflictsWith member function simply calls isCompatible(other) and returns the opposite.

    inline bool Split::conflictsWith(const Split & other) const { 
        return !isCompatible(other);
    } 

< 7.0 | 7.1 | 7.2 >