How to calculate the partial area under the curve (AUC)

Question

How to calculate the partial area under the curve (AUC)

In scikit learn, you can calculate the area under the curve for a binary classifier with

roc_auc_score( Y, clf.predict_proba(X)[:,1] )

I'm only interested in the part of the curve where the false positive speed is less than 0.1.

Given such a threshold false positive rate, how can I calculate the AUC for only part of the curve to the threshold?

Here is an example with several ROC curves to illustrate:

Scikit learn docs show how to use roc_curve

 >>> import numpy as np >>> from sklearn import metrics >>> y = np.array([1, 1, 2, 2]) >>> scores = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2) >>> fpr array([ 0. , 0.5, 0.5, 1. ]) >>> tpr array([ 0.5, 0.5, 1. , 1. ]) >>> thresholds array([ 0.8 , 0.4 , 0.35, 0.1 ]

Is there an easy way to switch from partial AUC?

The only problem seems to be how to calculate the tpr value with fpr = 0.1, since roc_curve does not necessarily give you this.

+6

python scikit-learn statistics machine-learning

eleanora Sep 16 '16 at 17:51

source share

5 answers

Ami tavory · Answer 1 · 2016-09-25T13:06:56+0000

Let's say we start with

 import numpy as np from sklearn import metrics

Now we set the true y and the predicted scores :

 y = np.array([0, 0, 1, 1]) scores = np.array([0.1, 0.4, 0.35, 0.8])

(Note that y has shifted by 1 from your problem. This is immaterial: exactly the same results (fpr, tpr, thresholds, etc.) are obtained regardless of whether you predict 1, 2 or 0, 1, but some sklearn.metrics functions - this is a drag and drop if not used 0, 1.)

Check out the AUC here:

 >>> metrics.roc_auc_score(y, scores) 0.75

As in your example:

 fpr, tpr, thresholds = metrics.roc_curve(y, scores) >>> fpr, tpr (array([ 0. , 0.5, 0.5, 1. ]), array([ 0.5, 0.5, 1. , 1. ]))

This gives the following graph:

 plot([0, 0.5], [0.5, 0.5], [0.5, 0.5], [0.5, 1], [0.5, 1], [1, 1]);

By construction, the ROC for a finite length y will consist of rectangles:

At a sufficiently low threshold, everything will be classified as negative.
As the threshold increases continuously, at some points some negative classifications will be changed to positive.

Thus, for a finite y, ROC will always be characterized by a sequence of connected horizontal and vertical lines leading from (0, 0) to (1, 1).

AUC is the sum of these rectangles. Here, as shown above, the AUC is 0.75, since the rectangles have areas of 0.5 * 0.5 + 0.5 * 1 = 0.75.

In some cases, people choose to calculate AUC by linear interpolation. Say the length y is much larger than the actual number of points calculated for FPR and TPR. Then, in this case, linear interpolation is an approximation of what could be between the points. In some cases, people also follow the hypothesis that if it were large enough, the points between them would be interpolated linearly. sklearn.metrics does not use this hypothesis and to obtain results consistent with sklearn.metrics , you must use a rectangle rather than a trapezoidal summation.

Let's write our own function to calculate AUC directly from fpr and tpr :

 import itertools import operator def auc_from_fpr_tpr(fpr, tpr, trapezoid=False): inds = [i for (i, (s, e)) in enumerate(zip(fpr[: -1], fpr[1: ])) if s != e] + [len(fpr) - 1] fpr, tpr = fpr[inds], tpr[inds] area = 0 ft = zip(fpr, tpr) for p0, p1 in zip(ft[: -1], ft[1: ]): area += (p1[0] - p0[0]) * ((p1[1] + p0[1]) / 2 if trapezoid else p0[1]) return area

This function accepts FPR and TPR and an optional parameter indicating whether keystone summation should be used. Running it, we get:

 >>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, True) (0.75, 0.875)

We get the same result as sklearn.metrics for summing a rectangle, and another, higher result for summing trapezoids.

So now we just need to see what happens to the FPR / TPR points if we end up with an FPR of 0.1. We can do this with the bisect module

 import bisect def get_fpr_tpr_for_thresh(fpr, tpr, thresh): p = bisect.bisect_left(fpr, thresh) fpr = fpr.copy() fpr[p] = thresh return fpr[: p + 1], tpr[: p + 1]

How it works? It just checks where the thresh insertion point in fpr . Given the properties of FPR (it should start at 0), the insertion point should be in a horizontal line. Thus, all rectangles before this should not be affected, all rectangles after this should be deleted, and this should be possibly shortened.

Let it be applicable:

 fpr_thresh, tpr_thresh = get_fpr_tpr_for_thresh(fpr, tpr, 0.1) >>> fpr_thresh, tpr_thresh (array([ 0. , 0.1]), array([ 0.5, 0.5]))

Finally, we just need to calculate the AUC from the updated versions:

 >>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, True) 0.050000000000000003, 0.050000000000000003)

In this case, both rectangular and trapezoidal sums give the same results. Please note that in general they will not. For consistency with sklearn.metrics , you should use the first.

Prune · Answer 2 · 2016-09-16T19:12:32+0000

It depends on whether the FPR character is x -axis or y (independent or dependent variable).

If it is x , the calculation is trivial: we calculate only the range [0.0, 0.1].

If it is y , you first need to solve the curve for y = 0.1 . This splits the x axis into areas you need to compute, and those that are simple rectangles with a height of 0.1.

To illustrate, suppose you find a function greater than 0.1 in two ranges: [x1, x2] and [x3, x4]. Calculate the area under the curve in the ranges

 [0, x1] [x2, x3] [x4, ...]

To do this, add the rectangles under y = 0.1 for the two intervals you found:

 area += (x2-x1 + x4-x3) * 0.1

Is that what you need to move you?

Francis charette migneault · Answer 3 · 2016-09-24T17:16:35+0000

Calculate fpr and tpr only in the range [0.0, 0.1].

You can then use numpy.trapz to evaluate the partial AUC (pAUC) as follows:

 pAUC = numpy.trapz(tpr_array, fpr_array)

This function uses a compound trapezoidal rule to evaluate the area under the curve.

Srik · Answer 4 · 2018-02-11T06:36:18+0000

I implemented the current best answer, and it did not give the correct results under any circumstances. I reimplemented and tested the implementation below. I also used the built-in trapezoidal function of AUC against recreation from scratch.

 def line(x_coords, y_coords): """ Given a pair of coordinates (x1,y2), (x2,y2), define the line equation. Note that this is the entire line vs. t the line segment. Parameters ---------- x_coords: Numpy array of 2 points corresponding to x1,x2 x_coords: Numpy array of 2 points corresponding to y1,y2 Returns ------- (Gradient, intercept) tuple pair """ if (x_coords.shape[0] < 2) or (y_coords.shape[0] < 2): raise ValueError('At least 2 points are needed to compute' ' area under curve, but x.shape = %s' % p1.shape) if ((x_coords[0]-x_coords[1]) == 0): raise ValueError("gradient is infinity") gradient = (y_coords[0]-y_coords[1])/(x_coords[0]-x_coords[1]) intercept = y_coords[0] - gradient*1.0*x_coords[0] return (gradient, intercept) def x_val_line_intercept(gradient, intercept, x_val): """ Given ax=X_val vertical line, what is the intersection point of that line with the line defined by the gradient and intercept. Note: This can be further improved by using line segments. Parameters ---------- gradient intercept Returns ------- (x_val, y) corresponding to the intercepted point. Note that this will always return a result. There is no check for whether the x_val is within the bounds of the line segment. """ y = gradient*x_val + intercept return (x_val, y) def get_fpr_tpr_for_thresh(fpr, tpr, thresh): """ Derive the partial ROC curve to the point based on the fpr threshold. Parameters ---------- fpr: Numpy array of the sorted FPR points that represent the entirety of the ROC. tpr: Numpy array of the sorted TPR points that represent the entirety of the ROC. thresh: The threshold based on the FPR to extract the partial ROC based to that value of the threshold. Returns ------- thresh_fpr: The FPR points that represent the partial ROC to the point of the fpr threshold. thresh_tpr: The TPR points that represent the partial ROC to the point of the fpr threshold """ p = bisect.bisect_left(fpr, thresh) thresh_fpr = fpr[:p+1].copy() thresh_tpr = tpr[:p+1].copy() g, i = line(fpr[p-1:p+1], tpr[p-1:p+1]) new_point = x_val_line_intercept(g, i, thresh) thresh_fpr[p] = new_point[0] thresh_tpr[p] = new_point[1] return thresh_fpr, thresh_tpr def partial_auc_scorer(y_actual, y_pred, decile=1): """ Derive the AUC based of the partial ROC curve from FPR=0 to FPR=decile threshold. Parameters ---------- y_actual: numpy array of the actual labels. y_pred: Numpy array of The predicted probability scores. decile: The threshold based on the FPR to extract the partial ROC based to that value of the threshold. Returns ------- AUC of the partial ROC. A value that ranges from 0 to 1. """ y_pred = list(map(lambda x: x[-1], y_pred)) fpr, tpr, _ = roc_curve(y_actual, y_pred, pos_label=1) fpr_thresh, tpr_thresh = get_fpr_tpr_for_thresh(fpr, tpr, decile) return auc(fpr_thresh, tpr_thresh)

plmedici · Answer 5 · 2017-08-23T21:02:23+0000

@eleanora Think of your momentum to use the general metrics.sucle method. This is the correct method (this is what I did). It should be simple as soon as you get a set of tpr and fpr points (and you can use scipy interpolation methods to approximate the exact points in each series).

How to calculate the partial area under the curve (AUC)

More articles: