Quick Pareto front calculation in Python

Question

Quick Pareto front calculation in Python

I have a set of points in three-dimensional space, from which I need to find the Pareto border. Speed of execution is very important here, and time is increasing very quickly, as I add points for testing.

The set of points is as follows:

[[0.3296170319979843, 0.0, 0.44472108843537406], [0.3296170319979843,0.0, 0.44472108843537406], [0.32920760896951373, 0.0, 0.4440408163265306], [0.32920760896951373, 0.0, 0.4440408163265306], [0.33815192743764166, 0.0, 0.44356462585034007]]

I am using this algorithm now:

 def dominates(row, candidateRow): return sum([row[x] >= candidateRow[x] for x in range(len(row))]) == len(row) def simple_cull(inputPoints, dominates): paretoPoints = set() candidateRowNr = 0 dominatedPoints = set() while True: candidateRow = inputPoints[candidateRowNr] inputPoints.remove(candidateRow) rowNr = 0 nonDominated = True while len(inputPoints) != 0 and rowNr < len(inputPoints): row = inputPoints[rowNr] if dominates(candidateRow, row): # If it is worse on all features remove the row from the array inputPoints.remove(row) dominatedPoints.add(tuple(row)) elif dominates(row, candidateRow): nonDominated = False dominatedPoints.add(tuple(candidateRow)) rowNr += 1 else: rowNr += 1 if nonDominated: # add the non-dominated point to the Pareto frontier paretoPoints.add(tuple(candidateRow)) if len(inputPoints) == 0: break return paretoPoints, dominatedPoints

Found here: http://code.activestate.com/recipes/578287-multidimensional-pareto-front/

What is the fastest way to find a set of non-dominant solutions in a decision ensemble? Or, in short, can Python be better than this algorithm?

+17

python numpy

Rodolphe 25 sept. '15 at 23:12

source share

4 answers

Updated August 2019

Here is another simple implementation that is pretty fast for modest sizes. Input points are assumed to be unique.

 def keep_efficient(pts): 'returns Pareto efficient row subset of pts' # sort points by decreasing sum of coordinates pts = pts[pts.sum(1).argsort()[::-1]] # initialize a boolean mask for undominated points # to avoid creating copies each iteration undominated = np.ones(pts.shape[0], dtype=bool) for i in range(pts.shape[0]): # process each point in turn n = pts.shape[0] if i >= n: break # find all points not dominated by i # since points are sorted by coordinate sum # i cannot dominate any points in 1,...,i-1 undominated[i+1:n] = (pts[i+1:] > pts[i]).any(1) # keep points undominated so far pts = pts[undominated[:n]] return pts

We start by sorting the points by the sum of the coordinates. This is useful because

For many data distributions, the point with the largest sum of coordinates will dominate a large number of points.
If point x has a larger sum of coordinates than point y , then y cannot dominate x .

Here are some tests regarding Peter's answer using np.random.randn .

 N=10000 d=2 keep_efficient 1.31 ms ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) is_pareto_efficient 6.51 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) N=10000 d=3 keep_efficient 2.3 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 16.4 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) N=10000 d=4 keep_efficient 4.37 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 21.1 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) N=10000 d=5 keep_efficient 15.1 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) is_pareto_efficient 110 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) N=10000 d=6 keep_efficient 40.1 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) is_pareto_efficient 279 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) N=10000 d=15 keep_efficient 3.92 s ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) is_pareto_efficient 5.88 s ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Heuristic convex hull

I recently dwelled on this problem and found a useful heuristic that works well if many points are distributed independently and there are few measurements.

The idea is to compute the convex hull of points. With small sizes and independently distributed points, the number of vertices of the convex hull will be small. Intuitively, we can expect that some vertices of the convex hull will dominate at many points of origin. Moreover, if a point in the convex hull does not dominate at any other point in the convex hull, then no point in the original set also dominates in it.

This provides a simple iterative algorithm. We have repeatedly

Calculate the convex hull.
Save the unpaid Pareto points from the convex hull.
Filter out the points to remove those that are dominated by convex hull elements.

I am adding a few tests to measure 3. It seems that for some distribution of points, this approach gives better asymptotics.

 import numpy as np from scipy import spatial from functools import reduce # test points pts = np.random.rand(10_000_000, 3) def filter_(pts, pt): """ Get all points in pts that are not Pareto dominated by the point pt """ weakly_worse = (pts <= pt).all(axis=-1) strictly_worse = (pts < pt).any(axis=-1) return pts[~(weakly_worse & strictly_worse)] def get_pareto_undominated_by(pts1, pts2=None): """ Return all points in pts1 that are not Pareto dominated by any points in pts2 """ if pts2 is None: pts2 = pts1 return reduce(filter_, pts2, pts1) def get_pareto_frontier(pts): """ Iteratively filter points based on the convex hull heuristic """ pareto_groups = [] # loop while there are points remaining while pts.shape[0]: # brute force if there are few points: if pts.shape[0] < 10: pareto_groups.append(get_pareto_undominated_by(pts)) break # compute vertices of the convex hull hull_vertices = spatial.ConvexHull(pts).vertices # get corresponding points hull_pts = pts[hull_vertices] # get points in pts that are not convex hull vertices nonhull_mask = np.ones(pts.shape[0], dtype=bool) nonhull_mask[hull_vertices] = False pts = pts[nonhull_mask] # get points in the convex hull that are on the Pareto frontier pareto = get_pareto_undominated_by(hull_pts) pareto_groups.append(pareto) # filter remaining points to keep those not dominated by # Pareto points of the convex hull pts = get_pareto_undominated_by(pts, pareto) return np.vstack(pareto_groups) # -------------------------------------------------------------------------------- # previous solutions # -------------------------------------------------------------------------------- def is_pareto_efficient_dumb(costs): """ :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): is_efficient[i] = np.all(np.any(costs>=c, axis=1)) return is_efficient def is_pareto_efficient(costs): """ :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): if is_efficient[i]: is_efficient[is_efficient] = np.any(costs[is_efficient]<=c, axis=1) # Remove dominated points return is_efficient def dominates(row, rowCandidate): return all(r >= rc for r, rc in zip(row, rowCandidate)) def cull(pts, dominates): dominated = [] cleared = [] remaining = pts while remaining: candidate = remaining[0] new_remaining = [] for other in remaining[1:]: [new_remaining, dominated][dominates(candidate, other)].append(other) if not any(dominates(other, candidate) for other in new_remaining): cleared.append(candidate) else: dominated.append(candidate) remaining = new_remaining return cleared, dominated # -------------------------------------------------------------------------------- # benchmarking # -------------------------------------------------------------------------------- # to accomodate the original non-numpy solution pts_list = [list(pt) for pt in pts] import timeit # print('Old non-numpy solution:s\t{}'.format( # timeit.timeit('cull(pts_list, dominates)', number=3, globals=globals()))) print('Numpy solution:\t{}'.format( timeit.timeit('is_pareto_efficient(pts)', number=3, globals=globals()))) print('Convex hull heuristic:\t{}'.format( timeit.timeit('get_pareto_frontier(pts)', number=3, globals=globals())))

results

 # >>= python temp.py # 1,000 points # Old non-numpy solution: 0.0316428339574486 # Numpy solution: 0.005961259012110531 # Convex hull heuristic: 0.012369581032544374 # >>= python temp.py # 1,000,000 points # Old non-numpy solution: 70.67529802105855 # Numpy solution: 5.398462114972062 # Convex hull heuristic: 1.5286884519737214 # >>= python temp.py # 10,000,000 points # Numpy solution: 98.03680767398328 # Convex hull heuristic: 10.203076395904645

Original message

I tried to rewrite the same algorithm with a few settings. I think most of your problems are related to inputPoints.remove(row) . This requires a list of points; index deletion would be much more efficient. I also changed the dominates function to avoid some redundant comparisons. It can be convenient in a higher dimension.

 def dominates(row, rowCandidate): return all(r >= rc for r, rc in zip(row, rowCandidate)) def cull(pts, dominates): dominated = [] cleared = [] remaining = pts while remaining: candidate = remaining[0] new_remaining = [] for other in remaining[1:]: [new_remaining, dominated][dominates(candidate, other)].append(other) if not any(dominates(other, candidate) for other in new_remaining): cleared.append(candidate) else: dominated.append(candidate) remaining = new_remaining return cleared, dominated

+7

hilberts_drinking_problem Sep 26 '15 at 2:35

source share

The definition of dominates is incorrect. A dominates B if and only if it is better or equal to B for all dimensions, and strictly better for at least one dimension.

+1

jmmcd Jul 24 '17 at 20:11

source share

Peter, a good answer.

I just wanted to generalize for those who want to choose the default maximization to minimize. This is a trivial fix, but nice to document here:

 def is_pareto(costs, maximise=False): """ :param costs: An (n_points, n_costs) array :maximise: boolean. True for maximising, False for minimising :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): if is_efficient[i]: if maximise: is_efficient[is_efficient] = np.any(costs[is_efficient]>=c, axis=1) # Remove dominated points else: is_efficient[is_efficient] = np.any(costs[is_efficient]<=c, axis=1) # Remove dominated points return is_efficient

+1

elzurdo Jul 26 '18 at 16:39

source share

Peter · Accepted Answer · 2016-10-25T12:10:37+0000

If you're concerned about real speed, you definitely need to use numpy (since smart algorithmic settings are likely to have much less effect than the benefits of using array operations). Here are three solutions that all calculate the same function. The is_pareto_efficient_dumb solution is_pareto_efficient_dumb slower in most situations, but it becomes faster as the cost increases, the is_pareto_efficient_simple solution is_pareto_efficient_simple much more efficient than the dumb solution for many points, and the final is_pareto_efficient function is_pareto_efficient less readable but the fastest (so everyone is Pareto efficient!).

 import numpy as np # Very slow for many datapoints. Fastest for many costs, most readable def is_pareto_efficient_dumb(costs): """ Find the pareto-efficient points :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): is_efficient[i] = np.all(np.any(costs[:i]>c, axis=1)) and np.all(np.any(costs[i+1:]>c, axis=1)) return is_efficient # Fairly fast for many datapoints, less fast for many costs, somewhat readable def is_pareto_efficient_simple(costs): """ Find the pareto-efficient points :param costs: An (n_points, n_costs) array :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient """ is_efficient = np.ones(costs.shape[0], dtype = bool) for i, c in enumerate(costs): if is_efficient[i]: is_efficient[is_efficient] = np.any(costs[is_efficient]<c, axis=1) # Keep any point with a lower cost is_efficient[i] = True # And keep self return is_efficient # Faster than is_pareto_efficient_simple, but less readable. def is_pareto_efficient(costs, return_mask = True): """ Find the pareto-efficient points :param costs: An (n_points, n_costs) array :param return_mask: True to return a mask :return: An array of indices of pareto-efficient points. If return_mask is True, this will be an (n_points, ) boolean array Otherwise it will be a (n_efficient_points, ) integer array of indices. """ is_efficient = np.arange(costs.shape[0]) n_points = costs.shape[0] next_point_index = 0 # Next index in the is_efficient array to search for while next_point_index<len(costs): nondominated_point_mask = np.any(costs<costs[next_point_index], axis=1) nondominated_point_mask[next_point_index] = True is_efficient = is_efficient[nondominated_point_mask] # Remove dominated points costs = costs[nondominated_point_mask] next_point_index = np.sum(nondominated_point_mask[:next_point_index])+1 if return_mask: is_efficient_mask = np.zeros(n_points, dtype = bool) is_efficient_mask[is_efficient] = True return is_efficient_mask else: return is_efficient

Profiling tests (using points taken from the normal distribution):

With 10,000 samples 2 costs:

 is_pareto_efficient_dumb: Elapsed time is 1.586s is_pareto_efficient_simple: Elapsed time is 0.009653s is_pareto_efficient: Elapsed time is 0.005479s

With 1,000,000 samples 2 costs:

 is_pareto_efficient_dumb: Really, really, slow is_pareto_efficient_simple: Elapsed time is 1.174s is_pareto_efficient: Elapsed time is 0.4033s

With 10,000 samples 15 costs:

 is_pareto_efficient_dumb: Elapsed time is 4.019s is_pareto_efficient_simple: Elapsed time is 6.466s is_pareto_efficient: Elapsed time is 6.41s

Keep in mind that if the problem is efficiency, you can increase it by about 2 times by pre-ordering the data, see here .

Quick Pareto front calculation in Python

Updated August 2019

Heuristic convex hull

results

Original message

More articles: