AUC and AUPRC for non-informative or “no-skill” risk score

Published:

Doing some math to show that the area under the receiver-operating characteristic curve (AUROC or AUC) for a non-informative or “no-skill” risk score or classifier (e.g., coin flip) is 0.5, and the area under the precision-recall curve (AUPRC) is the outcome prevalence.

Let $X$ be a random variable representing a continuous risk score with survivor function $S_X(x) = P(X > x)$ and $D$ a dichotomous random variable with a $Bernoulli(\pi)$ distribution representing, say, disease status. I would characterize a non-informative or no-skill risk score as one that can’t discriminate between individuals who have and do not have the disease, i.e., $X$ and $D$ are independent given by $P(X\vert D) = P(X)$. This implies that $S_X(c) = P(X > c) = P(X > c\vert D = 1) = P(X > c\vert D=0)$, which I will use.

AUROC for non-informative classifier is 0.5

Recall that points on the ROC curve are given by $\lbrace\left(S_X(c\vert D=0), S_X(c\vert D=1)\right)\rbrace$, spanning over all possible threshold values $c$. For a non-informative risk score, recall that $S_X(c) = P(X > c) = P(X > c\vert D = 1) = P(X > c\vert D=0)$. It follows that the points on the ROC curve are then given by $\lbrace\left(S_X(c), S_X(c)\right)\rbrace,$ for all possible values of $c$, i.e., the diagonal line between (0,0) and (1,1). It follows that the area under the curve (diagonal line) is 0.5.

AUPRC for non-informative classifier is the outcome prevalence

The vertical axis for the precision-recall curve represents the precision := PPV = $P(D=1|X>c)$. Using Bayes’ rule and $P(X > c\vert D = 1) = P(X > c\vert D=0)$ for a non-informative risk score:

\[\begin{align*} P(D=1|X>c) &= \displaystyle\frac{P(X>c|D=1)P(D=1)}{P(X>c|D=1)P(D=1) + P(X>c|D=0)P(D=0)} \\ &= \displaystyle\frac{P(X>c|D=1)P(D=1)}{P(X>c|D=1)P(D=1) + P(X>c|D=1)P(D=0)} \\ &= \displaystyle\frac{P(X>c|D=1)P(D=1)}{P(X>c|D=1)[P(D=1) + P(D=0)]} \\ &= \displaystyle\frac{P(X>c|D=1)P(D=1)}{P(X>c|D=1)} \\ &= P(D=1) \end{align*}\]

for any value $c$. So the “y” value is always $P(D)=\pi$ (i.e., a horizontal line), the outcome prevalence. The horizontal axis is the recall := sensitivity = $P(X > c \vert D=1)$, which is a probability ranging between 0 and 1. Therefore the area under the precision-recall curve is the area under the horizontal line with value $y=\pi$ over the range of x-values between 0 and 1, i.e., $\pi$.