1
\$\begingroup\$

I'm trying to create an algorithm that finds the closest pair of points (2D) within a set of points. I'm using a divide and conquer method which is explained here Closest Pair of Points Algorithm.

However, my algorithm is really the best when it comes to huge sets of points, it takes a lot more time than it should.

Any idea on how to optimize my code?

from timeit import timeit
from sys import argv
from geo.point import Point

def load_instance(filename):
    """
    loads .mnt file.
    returns list of points.
    """
    with open(filename, "r") as instance_file:
        points = [Point((float(p[0]), float(p[1]))) for p in (l.split(',') for l in instance_file)]

    return points

def distance_points(point1, point2):
  
    p1_x = point1.coordinates[0]
    p1_y = point1.coordinates[1]
    p2_x = point2.coordinates[0]
    p2_y = point2.coordinates[1]

    return (p1_x - p2_x) \
    * (p1_x - p2_x) \
    + (p1_y - p2_y) \
     * (p1_y - p2_y)



def brute(points):
    
    d_min = float('inf')
    n = len(points)
    for i, point1 in enumerate(points):
        for point2 in points[i+1:]:
            d = distance_points(point1, point2)
            if d < d_min:
                d_min = d
                p1 = point1
                p2 = point2

    return p1, p2, d_min

def closest_pair(points_x, points_y):
    n = len(points_x)
    if n <= 3:
        if n == 2: return points_x[0], points_x[1], distance_points(points_x[0], points_x[1])
        return brute(points_x)
    mid = n // 2
    points_x_left = points_x[:mid]
    set_x_left = set(points_x_left)
    points_x_right = points_x[mid:]

    midpoint = points_x[mid].coordinates[0]

    points_y_left = []
    points_y_right = []

    for x in points_y:
        if x in set_x_left:
            points_y_left.append(x)
        else:
            points_y_right.append(x)

    (p1, q1, d1) = closest_pair(points_x_left, points_y_left)
    (p2, q2, d2) = closest_pair(points_x_right, points_y_right)


    if d1 <= d2:
        d = d1
        pair_min = (p1, q1)
    else:
        d = d2
        pair_min = (p2, q2)

    (p3, q3, d3) = closest_pair_boundary(points_x, points_y, d, pair_min)

    if d <= d3:
        return pair_min[0], pair_min[1], d
    else:
        return p3, q3, d3


def closest_pair_boundary(points_x, points_y, d, pair_min):
    global somme
    somme +=1
    n = len(points_x)
    midx = points_x[n//2].coordinates[0]

    s_y = [x for x in points_y if midx - d <= x.coordinates[0] <= midx + d]

    d_min = d
    m = len(s_y)
    for i, point_i in enumerate(s_y[:-1]):
        for point_j in s_y[i+1:min(m, i+7)]:
            p, q = point_i, point_j
            dist = distance_points(p,q)
            if dist < d_min:
                pair_min = p, q
                d_min = dist
    return pair_min[0], pair_min[1], d_min

def print_solution(points):
    points_x = sorted(points, key=lambda p: p.coordinates[0])
    points_y = sorted(points, key=lambda p: p.coordinates[1])
    p1, p2, d_min = closest_pair(points_x, points_y)
    print(f'{p1}; {p2}')

My "Point" class if pretty classic, two coordinates and a function that calculates and returns the distance between two points.

I'm using search in a set instead of a comparison because it proves more effective. Still, my code could do better. Any help? Maybe there are algorithms that are better suited for bigger sets of data? Or maybe some functions I've used don't work well with big sets or big lists. I'm just looking for any insights really.

(I can't import any libraries)

EDIT : Changed distance function, doesn't return square root now. Use of enumerate instead of relying on ranges. Profiler shows most time spent on closest_pair_boundary function. Still room for improvement.

\$\endgroup\$
7
  • 3
    \$\begingroup\$ If you can't import any libraries, what is geo? \$\endgroup\$ Commented Apr 16, 2022 at 22:49
  • \$\begingroup\$ One simple suggestion, do not use nested for loops, in your case you have two level nested loops and it should be written using zip instead, and do not iterate using range and then using indexing to get the element, instead you should iterate over the list and get the element directly, with list slicing if necessary. \$\endgroup\$ Commented Apr 17, 2022 at 5:00
  • \$\begingroup\$ The geo library is just for testing and drawing stuff. I was wondering that maybe the sort() function in Python isn't the best for large sets of data and that maybe I should define a Merge Sort function, what do you guys think? And thanks for the suggestion, I'll implement that and see if I get better results. \$\endgroup\$ Commented Apr 17, 2022 at 15:46
  • \$\begingroup\$ Use a profiler to see where your code is spending its time. sqrt() can be slow, so don't use the actual distance (e.g., sqrt(dx2 + dy2)), use the square of the distance (e.g., dx2 + dy2). That could save lots of sqrt() calls. Take a look at "line sweep" algorithms. \$\endgroup\$ Commented Apr 18, 2022 at 7:07
  • \$\begingroup\$ @RootTwo and @Ξένη Γήινος : I edited my code and added your suggestions. Profiler gives most time spent on closest_points_boundary. \$\endgroup\$ Commented Apr 21, 2022 at 7:28

2 Answers 2

1
\$\begingroup\$

What do you want to optimise it for - what are your bottlenecks right now?

You could profile the code to identify improvement areas, but if you truly are concerned with speed or memory footprint here I think it would be smarter to give pypy or cython a try. Or rewrite the algorithm in e.g. C or C++. Gains from doing lookup in sets or hashmaps over lists, or by using bit operations for comparisons, or by reducing your Point class into something smaller (or using slots, etc.) are likely never going to be as big as going away from "vanilla python".

As an aside, you should consider documenting your code - especially if using complicated algorithms - before asking for feedback. This is currently a bit rough to read.

\$\endgroup\$
1
\$\begingroup\$

Don't write, never publish uncommented/undocumented code.
Python got it right specifying docstrings such that it is easy to copy them with the code, and tedious to copy the code without them.

You followed the suggestion to drop the monotonous sqrt from distance_points()
- I think the name should be adjusted:

def square_euclidian_distance(point1, point2):
    """ Return the sum of squares of the coordinate differences.
    """  
    d_x = point1.coordinates[0] - point2.coordinates[0]
    d_y = point1.coordinates[1] - point2.coordinates[1]
    return d_x * d_x + d_y * d_y
    # return sum((a-b)**2 for a, b in zip(point1.coordinates, point2.coordinates))

A misconception on my part as an example consequence of lacking documentation:

[…], it is awkward for closest_pair() & closest_pair_boundary() to have collections points_x and points_y for parameters without any relation to attributes x and y. […]

- points_x and points_y are the same set of points, ascending by x- and y-coordinate, respectively.

I expected bisection to use the dimension promising to "square" the resulting sub-spaces.

If there is a significance to the 7 in closest_pair_boundary(), it needs to be disclosed: in the code.


As a no-brainer, I'd try to stop bisection (way) above 3, say 7, 15, 31…
Not switching to bisect on y where that promises to work better, stripes may get too narrow to be usefully divided - clearly, width below half of d_min is no help.

brute(points) doesn't try and take advantage of points being in ascending order - of x currently, but for narrow stripes, more could be saved passing points_y and breaking the inner loop on y coordinate too big for a distance below the current min.

Not quite as easy: Does knowing an upper bound on min distance help?

You state most time spent on closest_pair_boundary():
Adding a parameter min_distance_known to closest_pair() may help keeping "the stripe" smaller, more so with an otherwise unfortunately low limit to brute force.
More useful advice has been given in answers to other questions about the same problem, such as not controlling the number of points in the stripe to evaluate by the magic upper limit 7, but by difference in y less than d_min.


Interleaving the bits of two isometric coordinates, the pair leading to the minimum difference is no further apart than \$2 \sqrt 2\$(? 2?) times the distance of the closest pair.

\$\endgroup\$
1
  • \$\begingroup\$ (If predictable, it's somewhat disheartening to find each idea for improvement published somewhere. Like don't throw away information in constructing a 2d_min stripe: compare points from a left and a right d_min stripe to points of the other one.) \$\endgroup\$ Commented Oct 21, 2022 at 5:11

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.