Use one row per example. Required fields are label and score.
label must be 0 or 1. score should be between 0 and 1.
JSON also accepts aliases like y_true/target and prob/pred.
Uses all four confusion-matrix cells, which makes it robust under class imbalance.
Live Precision Story
Prevalence changes the group sizes, not the rates
PPV = TPR · Prev / (TPR · Prev + FPR · (1−Prev))
TPR and FPR are rates inside each class. Changing prevalence does not change those rates. It changes how many people those rates are applied to.
Reference cohort of 10,000 — the mosaic above uses your N slider
1
Out of 10,000 people, split them into the actually positive group and the actually negative group.
2
TPR is measured only within the positive group — it is set by the model, not by how large that group is.
3
FPR is the share of actually negative cases wrongly flagged. FPR is measured only within the negative group — it also does not shift when prevalence changes.
4
Precision = TP / (TP + FP) — dividing true positives by all predicted positives.
What changes is precision, because the same rates are applied to bigger or smaller groups.
Bayes' Theorem
P(D|+) = P(+|D) · P(D) / P(+)
The foundation of probabilistic classification. Given a positive test result (+), what is the probability the patient actually has the disease (D)? The answer depends on three things: how good the test is at catching true cases (likelihood), how common the disease is (prior), and how often the test fires overall (evidence). This is why a 99% accurate test can still give mostly false positives for a rare disease.
The mirror of PPV: "If the model says negative, how likely is that correct?" When prevalence is low, most negative predictions are correct (high NPV) because negatives dominate the population. As prevalence increases, the fraction of missed positives (FN) grows relative to true negatives, and NPV drops. NPV and PPV move in opposite directions as you change prevalence, which is a key Bayesian insight.
Sensitivity (Recall / TPR)
TPR = P(+|D) = TP / (TP + FN)
Of all the actual positives, how many did the model catch? A sensitivity of 0.95 means 95% of true cases are detected, but 5% are missed (false negatives). Critical in medical screening: missing a cancer diagnosis (FN) is usually worse than a false alarm (FP). Sensitivity is intrinsic to the model and threshold, not affected by prevalence.
Specificity (TNR)
TNR = P(−|D̄) = TN / (TN + FP)
Of all the actual negatives, how many did the model correctly identify? Specificity of 0.90 means 10% of healthy patients receive a false positive. Like sensitivity, specificity is independent of prevalence. High specificity is important when false positives are costly, such as unnecessary surgeries or wrongful convictions.
Accuracy
Acc = (TP + TN) / N
The most intuitive metric: what fraction of all predictions were correct? However, accuracy is misleading with imbalanced classes. A model that always predicts "negative" on a dataset with 95% negatives achieves 95% accuracy while catching zero true positives. Compare it with the Prevalence slider at extreme values to see this paradox in action.
F1 Score
F1 = 2 · PPV · TPR / (PPV + TPR)
The harmonic mean of precision and recall. The harmonic mean punishes extreme imbalance: if either precision or recall is near zero, F1 collapses. An F1 of 0.8 guarantees that both precision and recall are at least 0.67. It is prevalence-dependent (through precision) and is most useful when you care equally about false positives and false negatives.
AUC (Area Under ROC Curve)
AUC = integral of TPR d(FPR) over [0,1]
AUC measures the model's ability to discriminate between classes across all possible thresholds. It equals the probability that a randomly chosen positive scores higher than a randomly chosen negative. AUC = 0.5 means random guessing (diagonal ROC), AUC = 1.0 means perfect separation. AUC is threshold-independent and prevalence-independent, making it ideal for comparing models.
MCC is a correlation coefficient between the observed and predicted binary classifications, ranging from −1 (total disagreement) through 0 (random) to +1 (perfect). Unlike F1 and accuracy, MCC uses all four confusion matrix quadrants and remains reliable even with highly imbalanced datasets.
Likelihood Ratios
LR+ = TPR / FPR | LR− = FNR / TNR
Likelihood ratios express how much a test result shifts the odds. LR+ tells you how much more likely a positive result is in a true positive vs. a false positive. LR+ > 10 is considered strong diagnostic evidence. LR− < 0.1 is strong evidence for ruling out. Unlike PPV/NPV, likelihood ratios are independent of prevalence.