Approach #1
Well if you are not sure about the deciding the near-identical criteria, I think a well-known one would be based on distances among them. With that in mind some sort of distance based clustering solution could be a good fit here. So, here's one with sklearn.cluster.AgglomerativeClustering -
from sklearn.cluster import AgglomerativeClustering
def cluster_based_on_distance(a, dist_thresh=10):
kmeans= AgglomerativeClustering(n_clusters=None, distance_threshold=dist_thresh).fit(a)
return a[np.sort(np.unique(kmeans.labels_, return_index=True)[1])]
Sample runs -
In [16]: a
Out[16]:
array([[285, 849],
[450, 717],
[399, 715],
[399, 716],
[400, 715],
[450, 716],
[150, 716]])
In [17]: cluster_based_on_distance(a, dist_thresh=10)
Out[17]:
array([[285, 849],
[450, 717],
[399, 715],
[150, 716]])
In [18]: cluster_based_on_distance(a, dist_thresh=100)
Out[18]:
array([[285, 849],
[450, 717],
[150, 716]])
In [19]: cluster_based_on_distance(a, dist_thresh=1000)
Out[19]: array([[285, 849]])
Approach #2
Another based on euclidean-distance based thresholding with KDTree -
from scipy.spatial import cKDTree
def cluster_based_on_eucl_distance(a, dist_thresh=10):
d,idx = cKDTree(a).query(a, k=2)
min_idx = idx.min(1)
mask = d[:,1]>dist_thresh
mask[min_idx[~mask]] = True
return a[mask]
Approach #3
Another based on absolute differences between either of the columns -
def cluster_based_on_either_xydist(a, dist_thresh=10):
c0 = np.abs(a[:,0,None]-a[:,0])<dist_thresh
c1 = np.abs(a[:,1,None]-a[:,1])<dist_thresh
c01 = c0 & c1
return a[~np.triu(c01,1).any(0)]
x