I need to improve the performance of a function that calculates the integral of a two-dimensional kernel density estimate (obtained using the function stats.gaussian_kde) where the domain of integration is all points that evaluate below a given value.
I've found that a single line takes up pretty much 100% of the time used by the code:
insample = kernel(sample) < iso
(see the code below). This line stores in the list insample all the values in sample that evaluate in kernel to less than the float iso.
I made this question (slightly modified) 6 months ago over at Stack Overflow and the best answer involved parallelization. I'm hoping to avoid parallelization (gives way too much trouble as can be seen in the question linked above), but I'm open to any improvement in performance achieved by means of virtually any other way (including Cython and any package available).
Here's the minimum working example (MWE):
import numpy as np
from scipy import stats
# Define KDE integration function.
def kde_integration(m1, m2):
# Perform a kernel density estimate (KDE) on the data.
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values, bw_method=None)
# This list will be returned at the end of this function.
out_list = []
# Iterate through all floats in m1, m2 lists and calculate for each one the
# integral of the KDE for the domain of points located *below* the KDE
# value of said float eveluated in the KDE.
for indx, m1_p in enumerate(m1):
# Compute the point below which to integrate.
iso = kernel((m1_p, m2[indx]))
# Sample KDE distribution
sample = kernel.resample(size=100)
# THIS TAKES ALMOST 100% OF THE COMPUTATION TIME.
# Filter the sample.
insample = kernel(sample) < iso
# Monte Carlo Integral.
integral = insample.sum() / float(insample.shape[0])
# Avoid 'nan' and/or 'infinite' values down the line.
integral = integral if integral > 0. else 0.000001
# Append integral value for this point to list that will return.
out_list.append(round(integral, 2))
return out_list
# Generate some random two-dimensional data:
def measure(n):
"Return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
# Random data.
m1, m2 = measure(100)
# Call KDE integration function.
kde_integration(m1, m2)
Any help/suggestions/ideas will be very much appreciated.