6
\$\begingroup\$

I have unstructured input-output data, f=f(x). I want to estimate the range of f as a function of x, given limited sampled. So, I tried to discretize the data use small bins in x to estimate the variance of f associated with that bin. After a lot of work, I cobbled together a working code. But, I'm sure there must be a better way. My results are very sensitive to the bin size. Please review my approach and let me know how it can be improved.

from scipy.stats import binned_statistic
import matplotlib.pyplot as plt
import numpy as np

# Choose the number of bins
NBINS = 100

# define custom function for findin range
def myrangefunciton(arrA):
    return(np.max(arrA) - np.min(arrA))

# set seed for reproducability, there is nothing special about this seed
np.random.seed(1)

# generate sample data
x = np.random.normal(3, 1, 1000)
f = np.random.normal(0, np.max([x, np.zeros(x.size)], 0))

# compute ranges of f across x
stats, binEdges, binNum = binned_statistic(x, f,
                                           statistic=myrangefunciton, bins=100)

# scatter data
plt.scatter(x, f, s=0.5, label='Observed f(x) Data', c='k')

# transform the bin edges to centered x values
rangeX = binEdges[:-1] + (binEdges[1] - binEdges[0]) / 2.

# plot ranges to envelop the mean, which is 0.
plt.plot(rangeX, stats / 2, label='Estimated Range', c='blue')
plt.plot(rangeX, -1 * stats / 2, c='blue')

# complete summary plot
plt.title("%i bins" % NBINS)
plt.legend(prop={'size': 4})
plt.xlabel('x')
plt.savefig('rangeEx')
plt.clf()

enter image description here

enter image description here

\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

My results are very sensitive to the bin size

That's going to be difficult to escape. There are some automatic ways to estimate bin size; as an example for your input data and using histogram_bin_edges:

by_size = []
for method in (
    'auto', 'fd', 'doane', 'scott', 'stone', 'rice', 'sturges', 'sqrt',
):
    bins = np.histogram_bin_edges(a=x, bins=method)
    by_size.append((bins.size, bins[1]-bins[0], method))
for size, dx, method in sorted(by_size):
    print(f'{method:>7} bins={size} dx={dx:.3f}')
sturges bins=12 dx=2.471
  doane bins=14 dx=2.091
   rice bins=21 dx=1.359
  scott bins=26 dx=1.087
  stone bins=26 dx=1.087
   sqrt bins=33 dx=0.849
   auto bins=38 dx=0.735
     fd bins=38 dx=0.735

Which one to choose is non-trivial and requires detailed statistical knowledge of your true data model. Let's assume that auto is good enough.

You declare NBINS but never use it. Let's just delete it and use the method above.

Convert module np.max()-like calls to member arr_a.max() calls.

Replace np.max([x, np.zeros(x.size)], 0) with a call to .clip().

Convert from non-reentrant plt. module calls to axis object calls.

There is a way to vectorise binned_statistic but in this context I don't particularly think that it's worth it.

Because your data are binned, it's important that your plot calls use drawstyle='steps-mid'.

I don't actually think that your my_range_function is correct. Notice all of the data points in your plot that have not been included in the range.

All together,

import scipy.stats
import matplotlib.pyplot as plt
import numpy as np


def my_range_function(arr_a: np.ndarray) -> float:
    """custom function for findin range; assume centred on 0"""
    return 2*max(arr_a.max(), -arr_a.min())

# set seed for reproducibility, there is nothing special about this seed
rand = np.random.default_rng(seed=1)

# generate sample data
x = rand.normal(loc=3, scale=1, size=1_000)
f = rand.normal(loc=0, scale=x.clip(min=0))

# compute ranges of f across x
bins = np.histogram_bin_edges(a=x, bins='auto')
stats, bin_edges, bin_num = scipy.stats.binned_statistic(
    x=x, values=f, statistic=my_range_function, bins=bins,
)
np.clip(stats, a_min=0, a_max=None, out=stats)
# transform the bin edges to centred x values
range_x = bins[:-1] + 0.5*(bins[1] - bins[0])

fig, ax = plt.subplots()
ax.scatter(x, f, s=0.5, c='k', label='Observed f(x) Data')

# plot ranges to envelop the mean, which is assumed to be 0.
ax.plot(range_x, 0.5*stats, c='blue', drawstyle='steps-mid', label='Estimated Range')
ax.plot(range_x, -0.5*stats, c='blue', drawstyle='steps-mid')
ax.set_title(f'{bins.size} bins')
ax.legend()
ax.set_xlabel('x')
ax.set_ylabel('f')
plt.show()

range plot

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.