Coalescing ranges that have the same label

Question

I am working with some data that requires keeping track of different types of ranges. There are three key attributes to keep track of:

type (str): the "type" of the range
start (int): the value at which the range starts
stop (int): the value at which the range stops

e.g.

"TypeA"    100    200

I have to deal with a lot of these ranges and as such, storing every fragment is not the most efficient (regarding storage) nor is it the most performant when checking if a value is bounded by some known range.

So ideally overlapping ranges (two or more ranges of the same type with intersecting boundaries) would be concatenated into a singular range e.g.

"TypeA"    100    200
"TypeA"    150    300

becomes

"TypeA"    100    300

Since there are also different types, having something that handles a collection of such ranges would also be useful.

Below are two classes for this task LabeledRange (for a singular range), LabeledRanges for handling a collection of LabeledRanges

When working with large amounts of ranges, even in parallel, these classes end up lagging behind. So I would appreciate review with performance in mind. Additionally, the use of external libraries, e.g. numba is welcomed as I have found it does not work so well with these classes.

Resources

git repo of the source files as well as a jupyter notebook to play in if that helps

playground notebook a look at the current notebook (with output) to see how these classes are used in the wild.

stress test notebook for seeing where this lags. (more info at bottom of this question)

note: there is a branch on the repo called "numba" which tries to get numba to work with this concept... with limited success

Known Feature / bug

The class LabeledRanges has a method simplify which is meant to aggregate overlapping regions. This function can lag greatly.

This method utilizes a helper method called append which is used in the constructor of the class. Because of this, when constructing an instance of LabeledRanges, one may not get the most simplified version (depending on the order of the ranges passed in).

Classes

LabeledRange

from numbers import Number
from copy import copy, deepcopy

class LabeledRange:
    '''
    A helper class for keeping track of the start / stop of a given class in a
    sequence
    '''

    def __init__(self, name:str, start:int, stop:int):
        '''
        Arguments:
            name (str): name of the class
            start (int): the index at which the class starts
            stop (int): the index at which the class stops
        '''
        self.name  = name
        self.start = int(start)
        self.stop  = int(stop)


    '''
    Various conversions from LabeledRange to pythonic types
    '''
    def as_list(self):
        return [self.name, self.start, self.stop]
    def as_str_list(self):
        return [str(e) for e in self.as_list()]
    def as_tuple(self):
        return tuple(self.as_list())
    def as_dict(self):
        return dict(zip(['name', 'start', 'stop'], self.as_list()))
    def as_txt(self, delim='\t', newline='\n', newline_q=True):
        return delim.join(self.as_str_list()) + (newline if newline_q else '')
    def as_csv(self, newline='\n', newline_q=True):
        return self.as_txt(',', newline, newline_q)
    def as_tsv(self, newline='\n', newline_q=True):
        return self.as_txt('\t', newline, newline_q)
    def __hash__(self):
        return hash(self.as_tuple())
    def __repr__(self):
        return '{}{}'.format(self.__class__.__name__, self.as_tuple())
    def __str__(self):
        return self.__repr__()
    def __len__(self):
        return self.stop - self.start
    def __iter__(self):
        return (e for e in self.as_list())
    def __eq__(self, other):
        if not isinstance(other, LabeledRange):
            return False
        return (self.name  == other.name) and \
               (self.start == other.start) and \
               (self.stop  == other.stop)
    def __ne__(self, other):
        return not self.__eq__(other)

    def __contains__(self, other):
        '''
        Arguments:
            other (LabeledRange / int): If other is a LabeledRange, only true
                if other is bounded by self. If other is a number, true if
                self.start <= other <= self.stop
        Returns:
            results (bool)
        '''
        if isinstance(other, Number):
            return self.start <= other <= self.stop
        if not isinstance(other, LabeledRange):
            return False
        if not other.same_q(self):
            return False
        return other.start in self and other.stop in self


    def same_q(self, other):
        '''Whether or not other is of the same class'''
        if not isinstance(other, LabeledRange):
            return False
        return self.name == other.name

    def min(self, other):
        if not self.same_q(other):
            return min([self.start, self.stop])
        return min([self.start, self.stop, other.start, other.stop])

    def max(self, other):
        if not self.same_q(other):
            return max([self.start, self.stop])
        return max([self.start, self.stop, other.start, other.stop])

    def overlap_q(self, other):
        if not self.same_q(other):
            return False
        return any([
            other.start in self, other.stop in self,
            self.start in other, self.stop in other
        ])

    def __add__(self, other):
        '''
        Attempt to combine two ranges together.
        '''
        if not isinstance(other, LabeledRange):
            raise ValueError('{} is not a LabeledRange'.format(other))
        if not self.overlap_q(other):
            return LabeledRanges([deepcopy(self), deepcopy(other)])
        else:
           return LabeledRange(self.name, self.min(other), self.max(other))

    def __iadd__(self, other):
        if self.overlap_q(other):
            self.start = self.min(other)
            self.stop  = self.max(other)
        return self

LabeledRanges



class LabeledRanges:
    def __init__(self, ranges:list=[]):
        self.ranges = ranges

    def classes(self):
        return set([rng.name for rng in self])
    def as_list(self):
        return [rng.as_list() for rng in self]
    def as_tuple(self):
        return tuple([rng.as_tuple() for rng in self])


    @property
    def ranges(self):
        return self._ranges

    @ranges.setter
    def ranges(self, ranges):
        rngs = []
        for rng in ranges:
            if isinstance(rng, LabeledRange):
                rngs.append(rng)
            else:
                rngs.append(LabeledRange(*rng))
        self._ranges = list(set(rngs))

    @ranges.deleter
    def ranges(self):
        del self._ranges


    def __iter__(self):
        return (rng for rng in self.ranges)

    def __getitem__(self, key):
        return self.ranges[key]

    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        s = '{}('.format(self.__class__.__name__)
        if len(self.ranges) == 0:
            return s + ')'
        else:
            s += '\n'
        for i, rng in enumerate(self.ranges):
            s += '\t' + repr(rng) + '\n'
        s += ')'
        return s



    def __eq__(self, other):
        if isinstance(other, LabeledRanges):
            return all([rng in other for rng in self.ranges]) and \
                   all([rng in self for rng in other.ranges])
        return False

    def __ne__(self, other):
        return not self.__eq__(other)


    def __contains__(self, other):
        if isinstance(other, str):
            for rng in self:
                if rng.name == other.name:
                    return True
            return False

        if isinstance(other, LabeledRange):
            for rng in self:
                if other in rng:
                    return True
            return False

        if isinstance(other, LabeledRanges):
            for rng in other:
                if not self.__contains__(rng):
                    return False
            return True
        return False

    def overlap_q(self, other):
        for rng in self.ranges:
            if rng.overlap_q(other):
                return True
        return False

    def append(self, other):

        # Append a range
        if isinstance(other, LabeledRange):
            found_q = False
            for rng in self:
                if rng.overlap_q(other):
                    found_q = True
                    rng += other
            if not found_q:
                self.ranges.append(other)

        # Map each range to the above block
        if isinstance(other, LabeledRanges):
            for rng in other:
                self.append(other)

        return self


    def __give__(self, other):
        if isinstance(other, LabeledRange):
            self.append(other)

        if isinstance(other, LabeledRanges):
            for rng in other:
                self.append(rng)

        return self.simplify()

    def simplify(self):
        for rng in self:
            self.append(rng)
        self.ranges = list(set(self.ranges))
        return self

    def __add__(self, other):
        cp = deepcopy(self)
        cp.__give__(other)
        return cp

    def __iadd__(self, other):
        self.__give__(other)
        return self

    def __radd__(self, other):
        if not isinstance(other, LabeledRange) or not isinstance(other, LabeledRanges):
            return self
        self.__iadd__(other)
        return self

Stress Test

This "stress test" is a simplified version of something that I am using this code for.

In short:

there are >1Mil "known" ranges with types and spans
these ranges are read in and converted into LabeledRanges (the then aggregated ranges are then saved in a file that can be more quickly loaded to memory)
there are then ~1Mil "unknown" ranges (corresponding to regions from the previously mentioned "known ranges"
I need to know, per integer, the "types" associated with that integer based on the "known" ranges

The provided stress test notebook for seeing where this lags. (more info at bottom of this question) demonstrates how intensive this is.

In truth, I have many sets of such cases, which I have "sharded" into subsets of the total range to help with performance but it still takes way to long.

make random data

# only python libraries used for conveninece to others looking at this

# custom classes for optimization, from `cloned git repo pip install -e ./`
from lrng import LabeledRange, LabeledRanges

# for making dummy data
from random import randint, choice

# for multiprocessing
import os
from multiprocessing import Pool, Value

# how many range types
types = ['Type {}'.format(t) for t in 'AB']

# ranges to make
num_ranges = 1000000

# minimum range length
min_len_val = 2

# maximum range length
max_len_val = 1000

# total length of the range
total_len = 100000


# generate random ranges
ranges = []
for i in range(num_ranges):
    _type = choice(types)
    _offset = randint(0, total_len-max_len_val)
    _len = randint(min_len_val, max_len_val)
    ranges.append([_type, _offset, _offset+_len])


# simplify could use some performance improvements
# note: using the constructor will not always result in the most simplified version depneding on the order of how 
# ranges are aggregated
known_ranges = LabeledRanges(ranges)#.simplify()


# generate random "unknown ranges"
ranges_to_label = []
for i in range(num_ranges):
    _offset = randint(0, total_len-max_len_val)
    _len = randint(min_len_val, max_len_val)
    ranges_to_label.append([_offset, _offset+_len])

stress test methods

# for printing
_pool_count = Value('i', -1, lock=True)


# whether or not the current "known range" (_range) is relevant for the current unknown stretch
def _keep_range(start, stop, _range):
    _type, range_start, range_stop = _range
    if range_stop < start: return
    if range_start > stop: return
    if not (
        start       <= range_start <= stop       or \
        start       <= range_stop  <= stop       or \
        range_start <= start       <= range_stop or \
        range_start <= stop        <= range_stop
    ):
        return
    return _range

# extract relevant reference ranges
def label(unknown_range, reference_ranges, processes=1, total=None, throttle=50):
    start, stop = unknown_range
    if processes is None: 
        processes = os.cpu_count()
    if processes == 1:
        res = [_keep_range(start, stop, _range) for _range in reference_ranges]
    else:
        with Pool(processes=processes) as pool:
            res = pool.starmap(_keep_range, [(start, stop, _range) for _range in reference_ranges])

    res = list(filter(lambda r : r is not None, res))

    with _pool_count.get_lock():
        _pool_count.value += 1
    if total is not None:
        if _pool_count.value % throttle == 0:
            frac = total
            print('\r                                                               ', flush=True, end='')
            print('\r{}%'.format(_pool_count.value / total * 100), flush=True, end='')

    return LabeledRanges(res)    

# do this en-masse
def label_all(unknown_ranges, reference_ranges, processes=None):
    if processes is None: 
        processes = os.cpu_count()

    with Pool(processes=processes) as pool:
        total = len(unknown_ranges)
        sargs = [(u_rng, reference_ranges, 1, total, 50) for u_rng in unknown_ranges]
        res = pool.starmap(label, sargs)

    print('\r{}%'.format(100), flush=True, end='')
    print('\ndone', flush=False, end='\n')
    return res

call stress test

# this will use all of your CPUs if you do not specify otherwise 
res = label_all(ranges_to_label, known_ranges)

I have skimmed over your notebook but I cannot see a place where performance would matter. Is there a way to create a short example which can be used to test possible performance improvements? — AlexV
– AlexV, Commented Jun 26, 2019 at 12:59
@AlexV stress test notebook for you. I will add relevant info to to question — SumNeuron
– SumNeuron, Commented Jun 26, 2019 at 13:42
@AlexV not sure caching would work for this, but I finished the numba version. I think the best case for performance is to use simplify (or now coalesce from lrng.numba) to reduce the number ranges to compare against. The subsequent problem - labeling a range based off some references - can be improved similarly. — SumNeuron
– SumNeuron, Commented Jul 3, 2019 at 9:26
@AlexV will do maybe this weekend. I'm just disappointed that even with reasonably (but on the larger sized) input that this takes a while even after jumping through the nopython mode numba hoops. I just thought of a hack which will help me for this particular case, but aside from that nothing. I would still appreciate your eye on the _coalesce function to see why reduction requires may not be done in a single pass. It may just be that depending on the ranges and the order, such is the nature of the problem :P Unrelated, know of any libraries that do something similar to this? — SumNeuron
– SumNeuron, Commented Jul 4, 2019 at 18:14

SumNeuron · Accepted Answer · 2019-07-05 15:47:34Z

Like above all code is available at the repo.

Performance was increased by accommodating the requirements of numba's nopython mode.

There are three functions behind coalesce:

coalesce: wrapper function around __coalesce. Main purpose is convert user friendly input (e.g. LabeledRanges) to numba friend input (numpy integer array)
__coalesce: while-loop wrapper over _coalesce until ranges can be simplified no further _coalesce: the "main" function

So lets look at _coalesce

@njit(cache=True)
def _coalesce(ranges):
    '''
    Simplifies the input ranges by merging overlaps ranges.

    Arguments:
        ranges (np.array): a list of ranges, with shape (-1, 3) where each range
            (sublist / row) is a list of length 3 consisting of: `[label, start, stop]`

    Returns:
        ranges (list): a simplified version of the input
    '''
    coalesced = np.array([0][:0]).reshape(-1, 3)
    for i in range(len(ranges)):
        label_a, start_a, stop_a = ranges[i]
        append_flag = True
        for j in range(len(coalesced)):
            label_b, start_b, stop_b = coalesced[j]
            if merge_q(label_a, start_a, stop_a, label_b, start_b, stop_b):
                append_flag = False
                coalesced[j] = [label_a, min(start_a, start_b), max(stop_a, stop_b)]
                break
        if append_flag:
            coalesced = np.concatenate((coalesced, np.array([[label_a, start_a, stop_a]])))
    return coalesced

where

@njit(cache=True)
def merge_q(label_a, start_a, stop_a, label_b, start_b, stop_b):
    '''
    Whether or not range `A` and range `B` can be merged.

    Arguments:
        label_a (int): label of range `A`
        start_a (int): start of range `A`
        stop_a (int): stop of range `A`
        label_b(int): label of range `B`
        start_b(int): start of range `B`
        stop_b (int): stop of range `B`

    Returns:
        answer (bool)
    '''
    if label_a != label_b: # not of same type
        return False
    elif stop_a < start_b: # a does not start and then overlap b
        return False
    elif stop_b < start_a: # b does not start and then overlap a
        return False
    else: # same type and overlap, merge into i, do not append
        return True

To my current understanding the while-loop wrapper __coalesce:

@njit(cache=True)
def __coalesce(ranges):
    '''
    A helper wrapper over the functino `_coalesce` which continues to reduce
    ranges until reduction no longer occurs.

    Arguments:
        ranges (np.array): a list of ranges, with shape (-1, 3) where each range
            (sublist / row) is a list of length 3 consisting of: `[label, start, stop]`

    Returns:
        ranges (list): a simplified version of the input
    '''
    _len = np.inf
    coalesced = ranges
    while _len > len(coalesced):
        coalesced = _coalesce(coalesced)
        _len = len(coalesced)
    return coalesced

is required as the way the results variable, coalesced is made, allows for disjointed ranges to not be merged.

e.g.

'Type A'     50    100
'Type A'    150    250
'Type A'     75    160

The second range does not overlap, causing it to be concatenated to coalesced, but then the third range is merged into the first range.

While sorting by the "start" of each range may help this situation, I believe there are still instances where this further reducible result might occur.

coalesced = np.array([0][:0]).reshape(-1, 3) looks a bit strange to me. coalesced = np.empty((0, 3)) should have the same effect and is clearer in its intent. — AlexV
– AlexV, Commented Jul 5, 2019 at 18:49
@AlexV indeed, but please notice the njit decorator which means the function has to be compatible with numba's nopython mode. Initializing empty arrays in numba is currently buggy as type and shape is unclear. So this gets around that bug — SumNeuron
– SumNeuron, Commented Jul 5, 2019 at 18:50
Have you actually tried this? Since I don't have your original data, I can only confirm that it works for my randomly generated test data. — AlexV
– AlexV, Commented Jul 5, 2019 at 19:25
@AlexV yup. I’ve tried it and there are notebooks at the repo with some dummy data and random data to test all of this — SumNeuron
– SumNeuron, Commented Jul 5, 2019 at 19:26

Stack Exchange Network

Coalescing ranges that have the same label

Resources

Known Feature / bug

Classes

LabeledRange

LabeledRanges

Stress Test

make random data

stress test methods

call stress test

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Coalescing ranges that have the same label

Resources

Known Feature / bug

Classes

LabeledRange

LabeledRanges

Stress Test

make random data

stress test methods

call stress test

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions