Updated August 2019
Here is another simple implementation that is pretty fast for modest sizes. Input points are assumed to be unique.
def keep_efficient(pts): 'returns Pareto efficient row subset of pts'
We start by sorting the points by the sum of the coordinates. This is useful because
- For many data distributions, the point with the largest sum of coordinates will dominate a large number of points.
- If point
x has a larger sum of coordinates than point y , then y cannot dominate x .
Here are some tests regarding Peter's answer using np.random.randn .
N=10000 d=2 keep_efficient 1.31 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) is_pareto_efficient 6.51 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) N=10000 d=3 keep_efficient 2.3 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 16.4 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) N=10000 d=4 keep_efficient 4.37 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 21.1 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) N=10000 d=5 keep_efficient 15.1 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 110 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) N=10000 d=6 keep_efficient 40.1 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) is_pareto_efficient 279 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) N=10000 d=15 keep_efficient 3.92 s ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) is_pareto_efficient 5.88 s ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Heuristic convex hull
I recently dwelled on this problem and found a useful heuristic that works well if many points are distributed independently and there are few measurements.
The idea is to compute the convex hull of points. With small sizes and independently distributed points, the number of vertices of the convex hull will be small. Intuitively, we can expect that some vertices of the convex hull will dominate at many points of origin. Moreover, if a point in the convex hull does not dominate at any other point in the convex hull, then no point in the original set also dominates in it.
This provides a simple iterative algorithm. We have repeatedly
- Calculate the convex hull.
- Save the unpaid Pareto points from the convex hull.
- Filter out the points to remove those that are dominated by convex hull elements.
I am adding a few tests to measure 3. It seems that for some distribution of points, this approach gives better asymptotics.
import numpy as np from scipy import spatial from functools import reduce # test points pts = np.random.rand(10_000_000, 3) def filter_(pts, pt): """ Get all points in pts that are not Pareto dominated by the point pt """ weakly_worse = (pts <= pt).all(axis=-1) strictly_worse = (pts < pt).any(axis=-1) return pts[~(weakly_worse & strictly_worse)] def get_pareto_undominated_by(pts1, pts2=None): """ Return all points in pts1 that are not Pareto dominated by any points in pts2 """ if pts2 is None: pts2 = pts1 return reduce(filter_, pts2, pts1) def get_pareto_frontier(pts): """ Iteratively filter points based on the convex hull heuristic """ pareto_groups = [] # loop while there are points remaining while pts.shape[0]: # brute force if there are few points: if pts.shape[0] < 10: pareto_groups.append(get_pareto_undominated_by(pts)) break # compute vertices of the convex hull hull_vertices = spatial.ConvexHull(pts).vertices # get corresponding points hull_pts = pts[hull_vertices] # get points in pts that are not convex hull vertices nonhull_mask = np.ones(pts.shape[0], dtype=bool) nonhull_mask[hull_vertices] = False pts = pts[nonhull_mask] # get points in the convex hull that are on the Pareto frontier pareto = get_pareto_undominated_by(hull_pts) pareto_groups.append(pareto) # filter remaining points to keep those not dominated by # Pareto points of the convex hull pts = get_pareto_undominated_by(pts, pareto) return np.vstack(pareto_groups) # -------------------------------------------------------------------------------- # previous solutions # -------------------------------------------------------------------------------- def is_pareto_efficient_dumb(costs): """ :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): is_efficient[i] = np.all(np.any(costs>=c, axis=1)) return is_efficient def is_pareto_efficient(costs): """ :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): if is_efficient[i]: is_efficient[is_efficient] = np.any(costs[is_efficient]<=c, axis=1) # Remove dominated points return is_efficient def dominates(row, rowCandidate): return all(r >= rc for r, rc in zip(row, rowCandidate)) def cull(pts, dominates): dominated = [] cleared = [] remaining = pts while remaining: candidate = remaining[0] new_remaining = [] for other in remaining[1:]: [new_remaining, dominated][dominates(candidate, other)].append(other) if not any(dominates(other, candidate) for other in new_remaining): cleared.append(candidate) else: dominated.append(candidate) remaining = new_remaining return cleared, dominated # -------------------------------------------------------------------------------- # benchmarking # -------------------------------------------------------------------------------- # to accomodate the original non-numpy solution pts_list = [list(pt) for pt in pts] import timeit # print('Old non-numpy solution:s\t{}'.format( # timeit.timeit('cull(pts_list, dominates)', number=3, globals=globals()))) print('Numpy solution:\t{}'.format( timeit.timeit('is_pareto_efficient(pts)', number=3, globals=globals()))) print('Convex hull heuristic:\t{}'.format( timeit.timeit('get_pareto_frontier(pts)', number=3, globals=globals())))
results
# >>= python temp.py
Original message
I tried to rewrite the same algorithm with a few settings. I think most of your problems are related to inputPoints.remove(row) . This requires a list of points; index deletion would be much more efficient. I also changed the dominates function to avoid some redundant comparisons. It can be convenient in a higher dimension.
def dominates(row, rowCandidate): return all(r >= rc for r, rc in zip(row, rowCandidate)) def cull(pts, dominates): dominated = [] cleared = [] remaining = pts while remaining: candidate = remaining[0] new_remaining = [] for other in remaining[1:]: [new_remaining, dominated][dominates(candidate, other)].append(other) if not any(dominates(other, candidate) for other in new_remaining): cleared.append(candidate) else: dominated.append(candidate) remaining = new_remaining return cleared, dominated