Iterate through two arrays calculating a statistic from indexes of zeros

Question

I write a function which takes as input 2 arrays of zeros and ones ~8000 elements per array. Input array size is variable, can have much higher number of elements. The density of ones is expected to be higher than zeros for most cases, but outliers may exist. My function eps calculates a statistic on these array pairs and returns the output. This statistic is just a way to find if 2 timeseries data correlate with each other. Relatively trivial operations of checking for 0 and noting the index where 0 is found in array and then some calculations. I tried my best to optimize for speed but the best I could get is 4.5 ~5 seconds (for 18k array pairs) using timeit (1000 runs) library. Fast execution is important as I need to run this function on billions of array pairs.

UPDATE: The solution I tried from the answers below improved runtime for larger inputs than my original approach. However, I still face the problem of running this statistical function a 1000 times per array pair. The background for this is to perform a permutation test on the statistic with 'x' (1000) number of permutations to arrive at a significance level at p= .01 Some asked below why I need to run this function billions of times. So my data is a timeseries data for events: 'E1,E2,E3.....En' I need to find combinations of all possible event pairs from this data i.e. nCr. (E1,E2), (E2,E3)....nCr times. This leads to a large number of event pairs about billions of event pairs for my data. Add to that the time complexity of a permutation test (1000x times/pair). Any suggestions if any other parts like permutation tests can be improved. Thanks
UPDATE2: Is size always ~8000? No it depends on input data, array size can be between 2000 ~ 10,000. What's your tau value? tau is a random positive integer or float and we jus test for different tau values, so I cannot given a certain Tau.

#e.g. inputs
#ts_1 = [0,1,1,0,0,1,1,0,......] #timeseries1
#ts_2 = [1,1,1,1,1,1,1,0,......] #timeseries2
# tau = positive integer/float   #time lag

import numpy as np
from itertools import product

 def eps(ts_1, ts_2, tau):  
    Q_tau = 0
    q_tau = 0
    
    event_index1, = np.where(np.array(ts_1) == 0)
    n1 = event_index1.shape[0]
    
    event_index2, = np.where(np.array(ts_2) == 0)
    n2 = event_index2.shape[0]
    
    if (n1 != 0 and n2 != 0):
        matching_idx = set(event_index1).intersection(event_index2)
        c_ij = c_ji = 0.5 *len(matching_idx)
        
        for x,y in product(event_index1,event_index2):
            if x-y > 0 and (x-y)<= tau:
                c_ij += 1
            elif y-x > 0 and (y-x) <= tau:
                c_ji += 1                     
                  
        Q_tau = (c_ij+c_ji)/math.sqrt( n1 * n2 )
        q_tau = (c_ij - c_ji)/math.sqrt( n1 * n2 )
    
    return Q_tau, q_tau

#permutation test

 def eps_permutationtest(ts_1,ts_2,permutations=1000,Q_tau): #Q_tau= original statistic to evaluate significance for
        
     lst_Q_tau = []    # list to hold Q_tau values from each shuffle 
        
     for i in range(0,permutations):
         np.random.shuffle(ts_1)         #shuffle ts_1 everytime
         Q_tau = eps(ts_1 = ts_1, ts_2 = ts_2 , tau= 5) 
         lst_Q_tau.append(Q_tau)
     lst_Q_tau= np.array(lst_Q_tau)
        
     p_val = len(np.where(lst_Q_tau>= [Q_tau])[0])/permutations
        
        
     return p_val  # significant if p < .01 alpha

Welcome to CodeReview@SE. See How do I ask a Good Question? on how to title your question. Please include clarifications in your post instead of commenting comments. (And keep everything DRY.) — greybeard
– greybeard, Commented Aug 17, 2021 at 7:05
What are your actual densities? Is size always ~8000? What's your tau value? Do you actually need the billions of results from the billions of array pairs, or are you actually after something else, which might be doable without doing all those pairs? — Kelly Bundy
– Kelly Bundy, Commented Aug 17, 2021 at 13:50
Since the statistic calculation is the slow part, I think it would make sense for you to explain what it actually computes. It saves us having to figure out your intentions from the code. — Seb
– Seb, Commented Aug 17, 2021 at 14:12
Why are you running it 1000x per pair? What changes between each of these runs? When you say "nCr", do you really mean "nC2", that is, you have n arrays, and you want to calculate this statistic for all possible pairs of arrays? Are all n arrays the same size, even if that size varies? Your edit provided some information, but it's still too terse for us to tell what you're actually doing. — Teepeemm
– Teepeemm, Commented Aug 20, 2021 at 12:39
Hi. I mentioned this loop per pair of 1000 times is called permutation test. The ts_1 is shuffled each time and the statistic is calculated again. This is so as to understand if results are significant, and did not merely occur by chance. This could be called as bootstrapping/resampling. nCr where n is total events and r = 2. This statistic needs to be calculated for all array pairs. — skynaive
– skynaive, Commented Aug 20, 2021 at 15:15

Reinderien · Accepted Answer · 2021-08-17 15:52:12Z

First, some basics of your implementation:

Type-hint your arguments; I had to read and investigate a bunch to deduce that tau is a positive integer
(y-x) <= tau does not need parens
save math.sqrt( n1 * n2 ) to a variable instead of repeating it

Full matrix implementation

This is an n**2 operation, which is on the order of 64,000,000 elements if your arrays are 8,000 elements long. That's beyond the point where I would have given up and used C and something like BLAS. If you actually care about performance, now is the time.

Your solution scales according to event density, being slower as event density increases. There is a solution that is constant-time in density:

Don't call np.where
Don't use set intersection; use vectorized comparison
Recognize that you can represent your c_ij summation as the sum over a partial triangular matrix (ij upper, ji lower) stopping at the diagonal set by tau
The matrix itself is formed by the broadcast of the two event vectors, and receives a mask for the upper and lower halves
The initial values for c_** are the halved trace (diagonal sum) of this broadcast matrix
For a given tau and vector lengths, you could save time by pre-calculating the masks (not shown in the reference code). This is very important and can reduce execution time of this method by about two thirds. Many kinds of datasets in the wild have constant dimensions and can benefit from this.

Somewhere in around 85% ones-density, this method has equivalent performance to the original (for the test vector length of 2000) if you don't pre-calculate the mask.

Sparse approach

A library that supports sparse matrices should do better than the above. Scipy has them. Things to note about this approach:

The input data need to be inverted - 1 for events, 0 everywhere else
Be careful not to call np.* methods; exclusively use sparse methods
Rather than forming the triangular bands and then summing at the end, it's faster to perform a difference-of-sums using tau
This scales with density, unlike the full-matrix approach, and uses less memory

Sorted logarithmic traversal

Riffing on @vpn's idea a little:

This doesn't need to be n**2 if you tell Numpy to correctly apply a binary search
This binary search can be vectorized via searchsorted
Rather than your Python-native set intersection, you can use intersect1d

For very high ones-densities, this method remains the fastest; but has poorer scaling characteristics than the full-matrix trace-only method for lower ones-densities (higher event densities).

Full or sparse matrix, traces only

(Yet another) approach is:

For either sparse or full matrix
non-masked
based on diagonal trace sum only, with no calls to the triangular methods
constant time in density

Particularly for the full matrix, this method is very fast: always faster than sparse, masked broadcast and original; and faster than logarithmic traversal for densities below ~70%.

Cumulative sum

For completeness the reference source below includes @KellyBundy's cumulative sum approach, which is competitive for all but very high densities.

Output

All method durations (ms)
  d%    trace   sorted   cumsum      old    bcast   sparse sparsetr
50.0     0.70     4.12     0.21   491.82    41.73    80.62    21.62 
55.0     0.64     3.27     0.18   411.49    40.09    65.46    18.25 
60.0     0.58     2.69     0.18   311.74    40.69    50.05    14.60 
65.0     0.58     2.27     0.23   250.12    41.50    35.52    13.45 
70.0     1.03     1.84     0.18   182.37    41.84    24.03    10.20 
75.0     1.34     1.40     0.18   113.49    42.36    16.37     7.31 
80.0     0.61     1.06     0.18    77.04    40.99    11.78     6.11 
85.0     0.81     0.69     0.17    44.68    52.77    10.30     5.95 
90.0     1.56     0.56     0.22    24.50    46.26     6.32     3.94 
95.0     1.50     0.26     0.18     6.27    40.60     4.80     3.48 
96.0     1.77     0.21     0.18     3.71    41.01     4.25     3.31 
96.5     1.79     0.17     0.18     1.89    40.69     4.16     3.24 
97.0     1.79     0.16     0.18     1.75    43.89     4.69     3.30 
97.5     1.87     0.14     0.18     1.16    43.34     4.31     3.22 
98.0     2.10     0.15     0.19     1.22    41.16     4.10     3.33 
98.5     1.75     0.10     0.18     0.42    40.75     3.94     3.15 
99.0     1.73     0.09     0.22     0.16    40.71     3.97     3.11 
99.5     1.78     0.08     0.18     0.09    41.29     4.09     3.42 

Fast method durations (us)
  d%    trace   sorted   cumsum
50.0      604     3993      124 
55.0      556     3504      124 
60.0      577     2895      125 
65.0      569     2440      130 
70.0      556     2041      126 
75.0      574     1372      125 
80.0      556     1078      123 
85.0      562      686      126 
90.0      563      477      128 
95.0      561      238      126 
96.0      558      176      126 
96.5      569      139      124 
97.0      557      137      131 
97.5      568      104      124 
98.0      601      116      128 
98.5      565       61      128 
99.0      567       51      129 
99.5      566       43      127

Reference source

import math
from numbers import Real
from timeit import timeit
from typing import Tuple

import numpy as np
from itertools import product

import scipy.sparse
from scipy.sparse import spmatrix


def old_eps(ts_1, ts_2, tau):
    Q_tau = 0
    q_tau = 0

    event_index1, = np.where(np.array(ts_1) == 0)
    n1 = event_index1.shape[0]

    event_index2, = np.where(np.array(ts_2) == 0)
    n2 = event_index2.shape[0]

    if (n1 != 0 and n2 != 0):
        matching_idx = set(event_index1).intersection(event_index2)
        c_ij = c_ji = 0.5 *len(matching_idx)

        for x,y in product(event_index1,event_index2):
            if x-y > 0 and (x-y)<= tau:
                c_ij += 1
            elif y-x > 0 and (y-x) <= tau:
                c_ji += 1

        Q_tau = (c_ij+c_ji)/math.sqrt( n1 * n2 )
        q_tau = (c_ij - c_ji)/math.sqrt( n1 * n2 )

    return Q_tau, q_tau


def bcast_eps(ts_1: np.ndarray, ts_2: np.ndarray, tau: int) -> Tuple[Real, Real]:
    if ts_1.shape != ts_2.shape or len(ts_1.shape) != 1:
        raise ValueError('Vectors must be flat and of the same length')
    N, = ts_1.shape

    events_1 = ts_1 == 0
    events_2 = ts_2 == 0
    n1 = np.sum(events_1)
    n2 = np.sum(events_2)
    if n1 == 0 or n2 == 0:
        return 0, 0

    all_true = np.ones((N, N), dtype=bool)
    upper_mask = np.logical_or(np.tril(all_true), np.triu(all_true, k=+1+tau))
    lower_mask = np.logical_or(np.triu(all_true), np.tril(all_true, k=-1-tau))

    matches = np.logical_and(events_1[np.newaxis, :], events_2[:, np.newaxis])
    matches_u = np.ma.array(matches, mask=upper_mask)
    matches_l = np.ma.array(matches, mask=lower_mask)

    n_matches = np.trace(matches)
    c_ij = c_ji = n_matches / 2
    c_ij += np.sum(matches_u)
    c_ji += np.sum(matches_l)

    den = math.sqrt(n1 * n2)
    Q_tau = (c_ij + c_ji) / den
    q_tau = (c_ij - c_ji) / den
    return Q_tau, q_tau


def trace_eps(ts_1: np.ndarray, ts_2: np.ndarray, tau: int) -> Tuple[Real, Real]:
    if ts_1.shape != ts_2.shape or len(ts_1.shape) != 1:
        raise ValueError('Vectors must be flat and of the same length')

    events_1 = ts_1 == 0
    events_2 = ts_2 == 0
    n1 = np.sum(events_1)
    n2 = np.sum(events_2)
    if n1 == 0 or n2 == 0:
        return 0, 0

    matches = np.logical_and(events_1[np.newaxis, :], events_2[:, np.newaxis])
    n_matches = np.trace(matches)
    c_ij = c_ji = n_matches / 2

    for k in range(1, tau+1):
        c_ij += np.trace(matches, k)
        c_ji += np.trace(matches, -k)

    den = math.sqrt(n1 * n2)
    Q_tau = (c_ij + c_ji) / den
    q_tau = (c_ij - c_ji) / den
    return Q_tau, q_tau


def sorted_eps(ts_1: np.ndarray, ts_2: np.ndarray, tau: int) -> Tuple[Real, Real]:
    if ts_1.shape != ts_2.shape or len(ts_1.shape) != 1:
        raise ValueError('Vectors must be flat and of the same length')

    event_index1, = np.where(np.array(ts_1) == 0)
    event_index2, = np.where(np.array(ts_2) == 0)
    n1, = event_index1.shape
    n2, = event_index2.shape

    if n1 == 0 or n2 == 0:
        return 0, 0

    n_matches, = np.intersect1d(
        event_index1, event_index2, assume_unique=True,
    ).shape
    c_ij = c_ji = n_matches/2

    insertions = np.searchsorted(
        a=event_index1,  # array to pretend insertion into
        v=event_index2,  # values to insert
    )

    for insertion, y in zip(insertions, event_index2):
        i1 = insertion
        if i1 < n1:
            if y == event_index1[i1]:
                i1 += 1
            for x in event_index1[i1:]:
                if x - y > tau:
                    break
                c_ij += 1

        i2 = insertion - 1
        if i2 >= 0:
            for x in event_index1[i2::-1]:
                if y - x > tau:
                    break
                c_ji += 1

    den = math.sqrt(n1 * n2)
    Q_tau = (c_ij + c_ji) / den
    q_tau = (c_ij - c_ji) / den

    return Q_tau, q_tau


def sparse_eps(ts_1: spmatrix, ts_2: spmatrix, tau: int) -> Tuple[Real, Real]:
    if ts_1.shape != ts_2.shape or len(ts_1.shape) != 2 or ts_1.shape[0] != 1:
        raise ValueError('Vectors must be flat and of the same length')

    n1 = ts_1.sum()
    n2 = ts_2.sum()
    if n1 == 0 or n2 == 0:
        return 0, 0

    matches = ts_2.T * ts_1
    matches_u = scipy.sparse.triu(matches, +1).sum() - scipy.sparse.triu(matches, k=+1+tau).sum()
    matches_l = scipy.sparse.tril(matches, -1).sum() - scipy.sparse.tril(matches, k=-1-tau).sum()

    n_matches = matches.diagonal().sum()
    c_ij = c_ji = n_matches / 2
    c_ij += matches_u
    c_ji += matches_l

    den = math.sqrt(n1 * n2)
    Q_tau = (c_ij + c_ji) / den
    q_tau = (c_ij - c_ji) / den
    return Q_tau, q_tau


def sparsetr_eps(ts_1: spmatrix, ts_2: spmatrix, tau: int) -> Tuple[Real, Real]:
    if ts_1.shape != ts_2.shape or len(ts_1.shape) != 2 or ts_1.shape[0] != 1:
        raise ValueError('Vectors must be flat and of the same length')

    n1 = ts_1.sum()
    n2 = ts_2.sum()
    if n1 == 0 or n2 == 0:
        return 0, 0

    matches = ts_2.T * ts_1
    n_matches = matches.diagonal().sum()
    c_ij = c_ji = n_matches / 2

    for k in range(1, tau+1):
        c_ij += matches.diagonal(+k).sum()
        c_ji += matches.diagonal(-k).sum()

    den = math.sqrt(n1 * n2)
    Q_tau = (c_ij + c_ji) / den
    q_tau = (c_ij - c_ji) / den
    return Q_tau, q_tau


def cumsum_eps(ts_1: spmatrix, ts_2: spmatrix, tau: int) -> Tuple[Real, Real]:
    ts_1 = 1 - ts_1
    ts_2 = 1 - ts_2
    n1 = ts_1.sum()
    n2 = ts_2.sum()
    if n1 == 0 or n2 == 0:
        return 0, 0

    cs1 = np.pad(ts_1.cumsum(), (0, tau), 'edge')
    cs2 = np.pad(ts_2.cumsum(), (0, tau), 'edge')
    c_ij = c_ji = 0.5 * (ts_1 * ts_2).sum()
    c_ij += ((cs1[tau:] - cs1[:-tau]) * ts_2).sum()
    c_ji += ((cs2[tau:] - cs2[:-tau]) * ts_1).sum()

    sqrt = math.sqrt(n1 * n2)
    Q_tau = (c_ij + c_ji) / sqrt
    q_tau = (c_ij - c_ji) / sqrt
    return Q_tau, q_tau


def test() -> None:
    shape = 2, 2000
    full_ts = np.ones(shape, dtype=np.uint8)
    sparse_ts = scipy.sparse.lil_matrix(shape, dtype=np.uint8)

    density = 0.9
    rand = np.random.default_rng(seed=0).random
    events = rand(shape) > density
    full_ts[events] = 0
    sparse_ts[events] = 1

    # Add some interesting values to test boundary conditions: on the diagonal
    full_ts[:, 5] = 0
    sparse_ts[:, 5] = 1

    Q_exp = 1.9446638724075895
    q_exp = 0.026566446344365977

    methods = (
        (old_eps, full_ts),
        (bcast_eps, full_ts),
        (trace_eps, full_ts),
        (sorted_eps, full_ts),
        (sparse_eps, sparse_ts),
        (sparsetr_eps, sparse_ts),
        (cumsum_eps, full_ts),
    )

    for eps, ts in methods:
        Q_tau, q_tau = eps(*ts, tau=10)
        assert math.isclose(Q_tau, Q_exp, abs_tol=1e-12, rel_tol=0)
        assert math.isclose(q_tau, q_exp, abs_tol=1e-12, rel_tol=0)


def compare(fast: bool) -> None:
    shape = 2, 2000
    rand = np.random.default_rng(seed=0).random

    n = 400 if fast else 1

    methods = (
        trace_eps,
        sorted_eps,
        cumsum_eps,
    )

    if fast:
        print('Fast method durations (us)')
        factor = 1e6
        fmt = '{:8.0f}'
    else:
        print('All method durations (ms)')
        factor = 1e3
        fmt = '{:8.2f}'
        methods += (
            old_eps,
            bcast_eps,
            sparse_eps,
            sparsetr_eps,
        )

    print('  d%', ' '.join(
        f'{method.__name__.removesuffix("_eps"):>8}'
        for method in methods
    ))

    densities = np.hstack((
        np.linspace(0.50, 0.95, 10),
        np.linspace(0.960, 0.995, 8),
    ))
    for density in densities:
        print(f'{100*density:4.1f}', end=' ')

        full_ts = np.ones(shape, dtype=np.uint8)
        sparse_ts = scipy.sparse.lil_matrix(shape, dtype=np.uint8)

        events = rand(shape) > density
        full_ts[events] = 0
        sparse_ts[events] = 1

        inputs = (
            full_ts,
            full_ts,
            full_ts,
        )
        if not fast:
            inputs += (
                full_ts,
                full_ts,
                sparse_ts,
                sparse_ts,
            )

        for eps, ts in zip(methods, inputs):
            t = timeit(lambda: eps(*ts, tau=10), number=n)

            print(fmt.format(t/n*factor), end=' ')
        print()
    print()


if __name__ == '__main__':
    test()
    compare(False)
    compare(True)

Thank you for the detailed explanation and the performance tests. I liked the sorted solution more for the straightforwardness, but will be deploying all on the real data for a more hands-on performance. — skynaive
– skynaive, Commented Aug 17, 2021 at 7:50
Thanks. Much cleaner display, too :-). Now I wonder what the OP's tau is really like. I think your trace solution gets faster with smaller tau, while mine gets faster with larger tau. At very high densities, the sorted solution approaches linear time, like mine always is, so they should differ by a constant factor there. Maybe either could be improved further to tweak this constant factor, but meh. I suspect bigger further improvement at a higher level, not by handling each vector pair faster but by reducing the number of pairs to do at all (or maybe it shouldn't even be done pairwise). — Kelly Bundy
– Kelly Bundy, Commented Aug 17, 2021 at 16:07

vnp · Accepted Answer · 2021-08-17 15:10:31Z

10

for x,y in product(event_index1,event_index2) looks like an efficiency killer. If the lengths of event_index1, event_index2 are L1, L2 respectively, this loop has O(L1 * L2) time complexity. You may get away with O(L1 * log L2) (or O(L2 * log L1), depending on which one is larger.

Notice that for any given x the c_ij is incremented by the number of ys in the [x - tau, x)range. Two binary searches over event_index2 would give you this range, along the lines of

    for x in event_index1:
       c_ij += bisect_left(event_index2, x) - bisect_left(event_index2, x - tau)

and ditto for c_ji. Of course, from bisect import bisect_left.

edited Aug 17, 2021 at 15:10

answered Aug 16, 2021 at 23:10

vnp

58.7k4 gold badges55 silver badges144 bronze badges

1

\$\begingroup\$ In my testing, your binary search suggestion is by far the fastest (though I used Numpy's search_sorted implementation rather than native Python). \$\endgroup\$

Reinderien
– Reinderien

2021-08-17 02:41:53 +00:00
Commented Aug 17, 2021 at 2:41
2

\$\begingroup\$ @Reinderien search_sorted shall indeed be better than native bisect. I am not too versed in numpy to suggest it. Thanks. \$\endgroup\$

vnp
– vnp

2021-08-17 02:52:56 +00:00
Commented Aug 17, 2021 at 2:52
\$\begingroup\$ @vnp I really like this solution. I never even thought it could be done using sorting instead of loop iteration. \$\endgroup\$

skynaive
– skynaive

2021-08-17 09:01:52 +00:00
Commented Aug 17, 2021 at 9:01

Add a comment |

Kelly Bundy · Accepted Answer · 2021-08-17 15:24:09Z

You have quadratic runtime, we can reduce it to linear. You want to count the numbers of zeros in certain ranges. Let's flip zeros and ones and count the numbers of ones in ranges by subtracting prefix sums:

def Kelly_eps(ts_1, ts_2, tau):
    ts_1 = 1 - ts_1
    ts_2 = 1 - ts_2
    n1 = ts_1.sum()
    n2 = ts_2.sum()
    if n1 == 0 or n2 == 0:
        return 0, 0

    cs1 = np.pad(ts_1.cumsum(), (0, tau), 'edge')
    cs2 = np.pad(ts_2.cumsum(), (0, tau), 'edge')
    c_ij = c_ji = 0.5 * (ts_1 * ts_2).sum()
    c_ij += ((cs1[tau:] - cs1[:-tau]) * ts_2).sum()
    c_ji += ((cs2[tau:] - cs2[:-tau]) * ts_1).sum()

    sqrt = math.sqrt(n1 * n2)
    Q_tau = (c_ij + c_ji) / sqrt
    q_tau = (c_ij - c_ji) / sqrt
    return Q_tau, q_tau

Using Reinderien's script to benchmark against its two fastest solutions:

d=50%  trace_eps:   1.37 ms  sorted_eps:   8.33 ms  Kelly_eps:   0.19 ms  
d=55%  trace_eps:   0.80 ms  sorted_eps:   7.10 ms  Kelly_eps:   0.17 ms  
d=61%  trace_eps:   1.20 ms  sorted_eps:   5.79 ms  Kelly_eps:   0.19 ms  
d=66%  trace_eps:   1.25 ms  sorted_eps:   4.20 ms  Kelly_eps:   0.22 ms  
d=72%  trace_eps:   0.98 ms  sorted_eps:   2.50 ms  Kelly_eps:   0.19 ms  
d=77%  trace_eps:   0.73 ms  sorted_eps:   1.62 ms  Kelly_eps:   0.13 ms  
d=83%  trace_eps:   0.73 ms  sorted_eps:   1.23 ms  Kelly_eps:   0.12 ms  
d=88%  trace_eps:   0.72 ms  sorted_eps:   0.65 ms  Kelly_eps:   0.12 ms  
d=94%  trace_eps:   0.71 ms  sorted_eps:   0.37 ms  Kelly_eps:   0.12 ms  
d=99%  trace_eps:   0.69 ms  sorted_eps:   0.07 ms  Kelly_eps:   0.12 ms

Thank you for this solution. On multiple tests, your solution gave higher runtimes than original and Reinderien solution. For a particular set runtimes were: Original : 434 secs Reinderien : 424 secs Kelly : 598 secs — skynaive
– skynaive, Commented Aug 20, 2021 at 9:52
@skynaive Well it very much depends on the parameters/context. I asked about them under your question, but so far you ignored that. — Kelly Bundy
– Kelly Bundy, Commented Aug 20, 2021 at 10:30

Peter Cordes · Accepted Answer · 2021-08-17 21:35:53Z

(This might not be 100% relevant since it's not about Python, and it's about just the intersection part, not the harder within-threshold-distance matching part. But maybe at least food for thought.)

matching_idx is the list of all places where both inputs have a 0, I think? You don't actually need that list, just the count of how many positions where ts_1[i] and ts_2[i] are both zero. It would be very fast to compute that from the original inputs, like count_zeros( ts_1 | ts_2 ), instead of using intersection on two lists of indices.

That wouldn't share any work with finding nearby indices, though, so you'd probably still need to generate the lists of indices where the inputs are zero. Still, it might be cheaper than the intersection part.

(IDK how to get numpy to do this (I don't know numpy), but it's straightforward in C with SIMD intrinsics, or any other way of getting a programming language to compile or JIT to machine code with SIMD instructions that gives you a similar level of control.)

To make it efficient, do the OR on the fly (one SIMD vector at a time, not storing the result in a new array), and the count_zeros probably do as len - count_ones(ts1|ts2).

If you store your input elements as packed bitmaps, then you can OR 128 or 256 bits per 16 or 32-byte SIMD vector, and efficiently popcount (https://github.com/WojciechMula/sse-popcount/). Some SIMD popcount algorithms would work equally efficiently to count zeros directly, in which case you can do that, but out of the box you'll generally find optimized code to count ones, hence my suggestion.

Or if you store your input arrays as unpacked 8-bit or 32-bit integer elements, you can just sum += ts_1[i] | ts_2[i] (ideally SIMD vectorized). To efficiently vectorize a sum of 8-bit 0 / 1 elements, see https://stackoverflow.com/questions/54541129/how-to-count-character-occurrences-using-simd - very similar problem to accumulating _mm256_cmpeq_epi8 results (0 / -1).

For 32-bit elements, you don't need to widen to avoid overflow when summing the OR results, but your data density would be very low (4x or 32x lower than possible), so hopefully you aren't doing that.

8000 bits is only 31.25 SIMD vectors of 256 bits each, or 1000 bytes, so this should be blazing fast, like maybe 100 clock cycles or so with the data already hot in L1d cache. (https://github.com/WojciechMula/sse-popcount/blob/master/results/skylake/skylake-i7-6700-gcc5.3.0-avx2.rst#input-size-1024b)

Thank you for posting your answer. Your answer comments about things I've never heard of, and you're talking about things I can't think of a way to implement in Python. Your answer is the first compelling argument I've seen against Python (wrt being a high-level language). However I don't think your answer is inappropriate as using C is an accepted practice when you need speed Python can't get you. I'd say it's idiomatic Python, but given I, like many other Python programmers, can't program in C, it may be a stretch to say idiomatic... — Peilonrayz
– Peilonrayz ♦, Commented Aug 17, 2021 at 21:55
@Peilonrayz: Just to be clear, count_zeros( ts_1 | ts_2 ) is pseudo-code, not something you can literally write in C. You can't OR whole arrays, you need to loop over them. (You could do basically that with C++ std::bitset<8000> - it does overload the | operator and has a .count() method to count ones which hopefully uses an optimized popcnt under the hood.) — Peter Cordes
– Peter Cordes, Commented Aug 17, 2021 at 21:59
FWIW I found that to be clear. The reason I mentioned C specifically, is because C is a common way to speed up Python. Such as in much of std-lib like bisect/_bisect or numpy which are coincidentally mentioned in the above answers. — Peilonrayz
– Peilonrayz ♦, Commented Aug 17, 2021 at 22:07

Teepeemm · Accepted Answer · 2021-08-18 02:08:20Z

The closest you come to documentation is telling us that you are calculating a statistic. What is this statistic? I tried searching for "Q statistic", "tau statistic" and "eps statistic", but didn't find anything that seemed to match what you've done. You may think you know what it's doing and would not forget, but you will completely forget in six months time, and when you come back to this you'll have as much knowledge as us. Your explanation of the input variables might as well go into a docstring.

Intersecting the event indices and subtracting x and y suggests that the input arrays have the same length. That should be documented if true.

You should know what a variable is just by its name, without looking back at how you defined it. eps, Q_tau, q_tau, and tau might pass that test (depending on what "this statistic" actually is). ts_1, ts_2, event_index1, event_index2, n1, n2, matching_idx, c_ij, c_ji, x and y don't. I would suggest naming them input_array_1, input_array_2, zeros_indices_of_1, zero_indices_of_2, num_zeros_1, num_zeros_2, zero_indices_of_both, above_diagonal, below_diagonal, index_1, and index_2. (Did you notice that you knew what all of those should be except for the diagonals?)

It appears you aren't using pandas, so let's not import it.

I like to return early if possible. You could have if ( n1 == 0 or n2 == 0 ): return 0,0 This would free you from declaring Q_tau and q_tau at the beginning and allow you to remove a level of indentation at the end.

Stack Exchange Network

Iterate through two arrays calculating a statistic from indexes of zeros

5 Answers 5

Full matrix implementation

Sparse approach

Sorted logarithmic traversal

Full or sparse matrix, traces only

Cumulative sum

Output

Reference source

You must log in to answer this question.

Hot Network Questions

Iterate through two arrays calculating a statistic from indexes of zeros

5 Answers 5

Full matrix implementation

Sparse approach

Sorted logarithmic traversal

Full or sparse matrix, traces only

Cumulative sum

Output

Reference source

You must log in to answer this question.

Related

Hot Network Questions