I am working with some data that requires keeping track of different types of ranges. There are three key attributes to keep track of:
- type (str): the "type" of the range
 - start (int): the value at which the range starts
 - stop (int): the value at which the range stops
 
e.g.
"TypeA"    100    200
I have to deal with a lot of these ranges and as such, storing every fragment is not the most efficient (regarding storage) nor is it the most performant when checking if a value is bounded by some known range.
So ideally overlapping ranges (two or more ranges of the same type with intersecting boundaries) would be concatenated into a singular range e.g.
"TypeA"    100    200
"TypeA"    150    300
becomes
"TypeA"    100    300
Since there are also different types, having something that handles a collection of such ranges would also be useful.
Below are two classes for this task LabeledRange (for a singular range), LabeledRanges for handling a collection of LabeledRanges
When working with large amounts of ranges, even in parallel, these classes end up lagging behind. So I would appreciate review with performance in mind. Additionally, the use of external libraries, e.g. numba is welcomed as I have found it does not work so well with these classes.
Resources
git repo of the source files as well as a jupyter notebook to play in if that helps
playground notebook a look at the current notebook (with output) to see how these classes are used in the wild.
stress test notebook for seeing where this lags. (more info at bottom of this question)
note: there is a branch on the repo called "numba" which tries to get numba to work with this concept... with limited success
Known Feature / bug
The class LabeledRanges has a method simplify which is meant to aggregate overlapping regions. This function can lag greatly.
This method utilizes a helper method called append which is used in the constructor of the class. Because of this, when constructing an instance of LabeledRanges, one may not get the most simplified version (depending on the order of the ranges passed in).
Classes
LabeledRange
from numbers import Number
from copy import copy, deepcopy
class LabeledRange:
    '''
    A helper class for keeping track of the start / stop of a given class in a
    sequence
    '''
    def __init__(self, name:str, start:int, stop:int):
        '''
        Arguments:
            name (str): name of the class
            start (int): the index at which the class starts
            stop (int): the index at which the class stops
        '''
        self.name  = name
        self.start = int(start)
        self.stop  = int(stop)
    '''
    Various conversions from LabeledRange to pythonic types
    '''
    def as_list(self):
        return [self.name, self.start, self.stop]
    def as_str_list(self):
        return [str(e) for e in self.as_list()]
    def as_tuple(self):
        return tuple(self.as_list())
    def as_dict(self):
        return dict(zip(['name', 'start', 'stop'], self.as_list()))
    def as_txt(self, delim='\t', newline='\n', newline_q=True):
        return delim.join(self.as_str_list()) + (newline if newline_q else '')
    def as_csv(self, newline='\n', newline_q=True):
        return self.as_txt(',', newline, newline_q)
    def as_tsv(self, newline='\n', newline_q=True):
        return self.as_txt('\t', newline, newline_q)
    def __hash__(self):
        return hash(self.as_tuple())
    def __repr__(self):
        return '{}{}'.format(self.__class__.__name__, self.as_tuple())
    def __str__(self):
        return self.__repr__()
    def __len__(self):
        return self.stop - self.start
    def __iter__(self):
        return (e for e in self.as_list())
    def __eq__(self, other):
        if not isinstance(other, LabeledRange):
            return False
        return (self.name  == other.name) and \
               (self.start == other.start) and \
               (self.stop  == other.stop)
    def __ne__(self, other):
        return not self.__eq__(other)
    def __contains__(self, other):
        '''
        Arguments:
            other (LabeledRange / int): If other is a LabeledRange, only true
                if other is bounded by self. If other is a number, true if
                self.start <= other <= self.stop
        Returns:
            results (bool)
        '''
        if isinstance(other, Number):
            return self.start <= other <= self.stop
        if not isinstance(other, LabeledRange):
            return False
        if not other.same_q(self):
            return False
        return other.start in self and other.stop in self
    def same_q(self, other):
        '''Whether or not other is of the same class'''
        if not isinstance(other, LabeledRange):
            return False
        return self.name == other.name
    def min(self, other):
        if not self.same_q(other):
            return min([self.start, self.stop])
        return min([self.start, self.stop, other.start, other.stop])
    def max(self, other):
        if not self.same_q(other):
            return max([self.start, self.stop])
        return max([self.start, self.stop, other.start, other.stop])
    def overlap_q(self, other):
        if not self.same_q(other):
            return False
        return any([
            other.start in self, other.stop in self,
            self.start in other, self.stop in other
        ])
    def __add__(self, other):
        '''
        Attempt to combine two ranges together.
        '''
        if not isinstance(other, LabeledRange):
            raise ValueError('{} is not a LabeledRange'.format(other))
        if not self.overlap_q(other):
            return LabeledRanges([deepcopy(self), deepcopy(other)])
        else:
           return LabeledRange(self.name, self.min(other), self.max(other))
    def __iadd__(self, other):
        if self.overlap_q(other):
            self.start = self.min(other)
            self.stop  = self.max(other)
        return self
LabeledRanges
class LabeledRanges:
    def __init__(self, ranges:list=[]):
        self.ranges = ranges
    def classes(self):
        return set([rng.name for rng in self])
    def as_list(self):
        return [rng.as_list() for rng in self]
    def as_tuple(self):
        return tuple([rng.as_tuple() for rng in self])
    @property
    def ranges(self):
        return self._ranges
    @ranges.setter
    def ranges(self, ranges):
        rngs = []
        for rng in ranges:
            if isinstance(rng, LabeledRange):
                rngs.append(rng)
            else:
                rngs.append(LabeledRange(*rng))
        self._ranges = list(set(rngs))
    @ranges.deleter
    def ranges(self):
        del self._ranges
    def __iter__(self):
        return (rng for rng in self.ranges)
    def __getitem__(self, key):
        return self.ranges[key]
    def __str__(self):
        return self.__repr__()
    def __repr__(self):
        s = '{}('.format(self.__class__.__name__)
        if len(self.ranges) == 0:
            return s + ')'
        else:
            s += '\n'
        for i, rng in enumerate(self.ranges):
            s += '\t' + repr(rng) + '\n'
        s += ')'
        return s
    def __eq__(self, other):
        if isinstance(other, LabeledRanges):
            return all([rng in other for rng in self.ranges]) and \
                   all([rng in self for rng in other.ranges])
        return False
    def __ne__(self, other):
        return not self.__eq__(other)
    def __contains__(self, other):
        if isinstance(other, str):
            for rng in self:
                if rng.name == other.name:
                    return True
            return False
        if isinstance(other, LabeledRange):
            for rng in self:
                if other in rng:
                    return True
            return False
        if isinstance(other, LabeledRanges):
            for rng in other:
                if not self.__contains__(rng):
                    return False
            return True
        return False
    def overlap_q(self, other):
        for rng in self.ranges:
            if rng.overlap_q(other):
                return True
        return False
    def append(self, other):
        # Append a range
        if isinstance(other, LabeledRange):
            found_q = False
            for rng in self:
                if rng.overlap_q(other):
                    found_q = True
                    rng += other
            if not found_q:
                self.ranges.append(other)
        # Map each range to the above block
        if isinstance(other, LabeledRanges):
            for rng in other:
                self.append(other)
        return self
    def __give__(self, other):
        if isinstance(other, LabeledRange):
            self.append(other)
        if isinstance(other, LabeledRanges):
            for rng in other:
                self.append(rng)
        return self.simplify()
    def simplify(self):
        for rng in self:
            self.append(rng)
        self.ranges = list(set(self.ranges))
        return self
    def __add__(self, other):
        cp = deepcopy(self)
        cp.__give__(other)
        return cp
    def __iadd__(self, other):
        self.__give__(other)
        return self
    def __radd__(self, other):
        if not isinstance(other, LabeledRange) or not isinstance(other, LabeledRanges):
            return self
        self.__iadd__(other)
        return self
Stress Test
This "stress test" is a simplified version of something that I am using this code for.
In short:
- there are >1Mil "known" ranges with types and spans
 - these ranges are read in and converted into 
LabeledRanges(the then aggregated ranges are then saved in a file that can be more quickly loaded to memory) - there are then ~1Mil "unknown" ranges (corresponding to regions from the previously mentioned "known ranges"
 - I need to know, per integer, the "types" associated with that integer based on the "known" ranges
 
The provided stress test notebook for seeing where this lags. (more info at bottom of this question) demonstrates how intensive this is.
In truth, I have many sets of such cases, which I have "sharded" into subsets of the total range to help with performance but it still takes way to long.
make random data
# only python libraries used for conveninece to others looking at this
# custom classes for optimization, from `cloned git repo pip install -e ./`
from lrng import LabeledRange, LabeledRanges
# for making dummy data
from random import randint, choice
# for multiprocessing
import os
from multiprocessing import Pool, Value
# how many range types
types = ['Type {}'.format(t) for t in 'AB']
# ranges to make
num_ranges = 1000000
# minimum range length
min_len_val = 2
# maximum range length
max_len_val = 1000
# total length of the range
total_len = 100000
# generate random ranges
ranges = []
for i in range(num_ranges):
    _type = choice(types)
    _offset = randint(0, total_len-max_len_val)
    _len = randint(min_len_val, max_len_val)
    ranges.append([_type, _offset, _offset+_len])
# simplify could use some performance improvements
# note: using the constructor will not always result in the most simplified version depneding on the order of how 
# ranges are aggregated
known_ranges = LabeledRanges(ranges)#.simplify()
# generate random "unknown ranges"
ranges_to_label = []
for i in range(num_ranges):
    _offset = randint(0, total_len-max_len_val)
    _len = randint(min_len_val, max_len_val)
    ranges_to_label.append([_offset, _offset+_len])
stress test methods
# for printing
_pool_count = Value('i', -1, lock=True)
# whether or not the current "known range" (_range) is relevant for the current unknown stretch
def _keep_range(start, stop, _range):
    _type, range_start, range_stop = _range
    if range_stop < start: return
    if range_start > stop: return
    if not (
        start       <= range_start <= stop       or \
        start       <= range_stop  <= stop       or \
        range_start <= start       <= range_stop or \
        range_start <= stop        <= range_stop
    ):
        return
    return _range
# extract relevant reference ranges
def label(unknown_range, reference_ranges, processes=1, total=None, throttle=50):
    start, stop = unknown_range
    if processes is None: 
        processes = os.cpu_count()
    if processes == 1:
        res = [_keep_range(start, stop, _range) for _range in reference_ranges]
    else:
        with Pool(processes=processes) as pool:
            res = pool.starmap(_keep_range, [(start, stop, _range) for _range in reference_ranges])
    res = list(filter(lambda r : r is not None, res))
    with _pool_count.get_lock():
        _pool_count.value += 1
    if total is not None:
        if _pool_count.value % throttle == 0:
            frac = total
            print('\r                                                               ', flush=True, end='')
            print('\r{}%'.format(_pool_count.value / total * 100), flush=True, end='')
    return LabeledRanges(res)    
# do this en-masse
def label_all(unknown_ranges, reference_ranges, processes=None):
    if processes is None: 
        processes = os.cpu_count()
    with Pool(processes=processes) as pool:
        total = len(unknown_ranges)
        sargs = [(u_rng, reference_ranges, 1, total, 50) for u_rng in unknown_ranges]
        res = pool.starmap(label, sargs)
    print('\r{}%'.format(100), flush=True, end='')
    print('\ndone', flush=False, end='\n')
    return res
call stress test
# this will use all of your CPUs if you do not specify otherwise 
res = label_all(ranges_to_label, known_ranges)
    
numbaversion. I think the best case for performance is to usesimplify(or nowcoalescefromlrng.numba) to reduce the number ranges to compare against. The subsequent problem - labeling a range based off some references - can be improved similarly. \$\endgroup\$nopythonmodenumbahoops. I just thought of a hack which will help me for this particular case, but aside from that nothing. I would still appreciate your eye on the_coalescefunction to see why reduction requires may not be done in a single pass. It may just be that depending on the ranges and the order, such is the nature of the problem :P Unrelated, know of any libraries that do something similar to this? \$\endgroup\$