ROC & Confusion Matrix Playground

Confusion Matrix

Positive

Negative

Predicted

—TP

—FP

—FN

—TN

Actual

Metrics

Hover a metric card, or tap it on iPhone, to highlight the matching pieces in the mosaic and confusion matrix. Tap the same card again to clear.

TPR(Recall)

—

TP/(TP+FN)

FPR

—

FP/(FP+TN)

Precision(PPV)

—

TP/(TP+FP)

Accuracy

—

(TP+TN)/N

F1 Score

—

2·P·R/(P+R)

AUC

—

Area under ROC

NPV

—

TN/(TN+FN)

MCC

—

Matthews corr.

Click any metric card in Values to jump here. These are the standard formulas written in the usual notation.

TPR (Recall)

Traditional Formula

TPR = TP / (TP + FN)

Also written as: Recall = TP / (TP + FN)

Among the actual positives, what fraction did the model catch?

FPR

Traditional Formula

FPR = FP / (FP + TN)

Also written as: FPR = 1 - TNR

Among the actual negatives, what fraction were wrongly predicted positive?

Precision (PPV)

Traditional Formula

Precision = TP / (TP + FP)

Also written as: PPV = TP / (TP + FP)

Among the predicted positives, what fraction are truly positive?

Accuracy

Traditional Formula

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Also written as: Accuracy = (TP + TN) / N

Out of all predictions, what fraction were correct?

F1 Score

Traditional Formula

F1 = 2TP / (2TP + FP + FN)

Also written as: F1 = 2 · Precision · Recall / (Precision + Recall)

A balanced score that gets high only when both precision and recall are high.

AUC

Traditional Formula

AUC = ∫₀¹ TPR(FPR) d(FPR)

Also written as: area under the ROC curve

It summarizes performance across all thresholds, not just the current one.

NPV

Traditional Formula

NPV = TN / (TN + FN)

NPV = negative predictive value

Among the predicted negatives, what fraction are truly negative?

MCC

Traditional Formula

MCC = (TP · TN - FP · FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))

Range: -1 to 1

Uses all four confusion-matrix cells, which makes it robust under class imbalance.

Live Precision Story

Prevalence changes the group sizes, not the rates

PPV = TPR · Prev / (TPR · Prev + FPR · (1−Prev))

TPR and FPR are rates inside each class. Changing prevalence does not change those rates. It changes how many people those rates are applied to.

Reference cohort of 10,000 — the mosaic above uses your N slider

Out of 10,000 people, split them into the actually positive group and the actually negative group.

TPR is measured only within the positive group — it is set by the model, not by how large that group is.

FPR is the share of actually negative cases wrongly flagged. FPR is measured only within the negative group — it also does not shift when prevalence changes.

Precision = TP / (TP + FP) — dividing true positives by all predicted positives.

What changes is precision, because the same rates are applied to bigger or smaller groups.

Bayes' Theorem

P(D|+) = P(+|D) · P(D) / P(+)

The foundation of probabilistic classification. Given a positive test result (+), what is the probability the patient actually has the disease (D)? The answer depends on three things: how good the test is at catching true cases (likelihood), how common the disease is (prior), and how often the test fires overall (evidence). This is why a 99% accurate test can still give mostly false positives for a rare disease.

NPV via Bayes

NPV = TNR · (1−Prev) / (TNR · (1−Prev) + FNR · Prev)

The mirror of PPV: "If the model says negative, how likely is that correct?" When prevalence is low, most negative predictions are correct (high NPV) because negatives dominate the population. As prevalence increases, the fraction of missed positives (FN) grows relative to true negatives, and NPV drops. NPV and PPV move in opposite directions as you change prevalence, which is a key Bayesian insight.

Sensitivity (Recall / TPR)

TPR = P(+|D) = TP / (TP + FN)

Of all the actual positives, how many did the model catch? A sensitivity of 0.95 means 95% of true cases are detected, but 5% are missed (false negatives). Critical in medical screening: missing a cancer diagnosis (FN) is usually worse than a false alarm (FP). Sensitivity is intrinsic to the model and threshold, not affected by prevalence.

Specificity (TNR)

TNR = P(−|D̄) = TN / (TN + FP)

Of all the actual negatives, how many did the model correctly identify? Specificity of 0.90 means 10% of healthy patients receive a false positive. Like sensitivity, specificity is independent of prevalence. High specificity is important when false positives are costly, such as unnecessary surgeries or wrongful convictions.

Accuracy

Acc = (TP + TN) / N

The most intuitive metric: what fraction of all predictions were correct? However, accuracy is misleading with imbalanced classes. A model that always predicts "negative" on a dataset with 95% negatives achieves 95% accuracy while catching zero true positives. Compare it with the Prevalence slider at extreme values to see this paradox in action.

F1 Score

F1 = 2 · PPV · TPR / (PPV + TPR)

The harmonic mean of precision and recall. The harmonic mean punishes extreme imbalance: if either precision or recall is near zero, F1 collapses. An F1 of 0.8 guarantees that both precision and recall are at least 0.67. It is prevalence-dependent (through precision) and is most useful when you care equally about false positives and false negatives.

AUC (Area Under ROC Curve)

AUC = integral of TPR d(FPR) over [0,1]

AUC measures the model's ability to discriminate between classes across all possible thresholds. It equals the probability that a randomly chosen positive scores higher than a randomly chosen negative. AUC = 0.5 means random guessing (diagonal ROC), AUC = 1.0 means perfect separation. AUC is threshold-independent and prevalence-independent, making it ideal for comparing models.

MCC (Matthews Correlation Coefficient)

MCC = (TP·TN − FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))

MCC is a correlation coefficient between the observed and predicted binary classifications, ranging from −1 (total disagreement) through 0 (random) to +1 (perfect). Unlike F1 and accuracy, MCC uses all four confusion matrix quadrants and remains reliable even with highly imbalanced datasets.

Likelihood Ratios

LR+ = TPR / FPR | LR− = FNR / TNR

Likelihood ratios express how much a test result shifts the odds. LR+ tells you how much more likely a positive result is in a true positive vs. a false positive. LR+ > 10 is considered strong diagnostic evidence. LR− < 0.1 is strong evidence for ruling out. Unlike PPV/NPV, likelihood ratios are independent of prevalence.

Class Distributions & Threshold

ROC Curve