Metrics¶
online_cp.metrics.Metric
¶
Base class for online evaluation metrics.
Subclasses must implement _score(self, y, Gamma, **kw) which
returns a single scalar for one observation.
Source code in src/online_cp/metrics.py
values: NDArray[np.floating[Any]]
property
¶
Per-step history as a numpy array.
update(y: Any = None, Gamma: Any = None, **kw: Any) -> float
¶
Record one observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y
|
scalar
|
True label / response. |
None
|
Gamma
|
ConformalPredictionSet or ConformalPredictionInterval
|
Prediction output from a conformal predictor. |
None
|
**kw
|
dict
|
Additional keyword arguments (p_values, cpd, epsilon, etc.). Each metric picks what it needs. |
{}
|
Returns:
| Type | Description |
|---|---|
float
|
The metric value for this observation. |
Source code in src/online_cp/metrics.py
get() -> float
¶
cumulative_mean() -> NDArray[np.floating[Any]]
¶
online_cp.metrics.Metrics
¶
Composite of multiple metrics, created via the + operator.
Example
metric = ErrorRate() + IntervalWidth() metric.update(y=1.0, Gamma=interval)
Source code in src/online_cp/metrics.py
online_cp.metrics.ErrorRate
¶
Bases: Metric
Fraction of times the true label falls outside the prediction set.
Works for both classifiers (prediction sets) and regressors (intervals).
Source code in src/online_cp/metrics.py
online_cp.metrics.ObservedExcess
¶
Bases: Metric
Number of incorrect labels in the prediction set (OE).
For classifiers: |Gamma| - 1 if y in Gamma, else |Gamma|. A conditionally proper efficiency criterion.
Source code in src/online_cp/metrics.py
online_cp.metrics.ObservedFuzziness
¶
Bases: Metric
Sum of p-values for incorrect labels (OF).
Requires p_values keyword argument (dict: label -> p-value).
A conditionally proper efficiency criterion independent of epsilon.
Source code in src/online_cp/metrics.py
online_cp.metrics.SetSize
¶
online_cp.metrics.IntervalWidth
¶
online_cp.metrics.WinklerScore
¶
Bases: Metric
Winkler interval score — a proper scoring rule for interval forecasts.
Requires the prediction interval to have .lower and .upper
attributes, and epsilon to be provided.
Source code in src/online_cp/metrics.py
online_cp.metrics.CRPS
¶
Bases: Metric
Continuous Ranked Probability Score for conformal predictive distributions.
.. deprecated::
This class delegates to :class:TruncatedCRPS. Prefer using
TruncatedCRPS or ConformalCRPS explicitly.
Requires cpd keyword argument (a conformal predictive distribution object).
Source code in src/online_cp/metrics.py
Venn Prediction Metrics¶
online_cp.metrics.BrierScore
¶
Bases: Metric
Brier score for Venn predictor outputs.
Evaluates the aggregated point probability from a VennPrediction
using the standard Brier score: :math:(p_{\text{point}} - \mathbf{1}\{y = k\})^2
summed over all labels.
Requires venn keyword argument (a VennPrediction object).
Source code in src/online_cp/metrics.py
online_cp.metrics.LogLoss
¶
Bases: Metric
Log loss for Venn predictor outputs.
Evaluates the aggregated point probability from a VennPrediction
using negative log-likelihood: :math:-\log(p_{\text{point}}[y]).
Requires venn keyword argument (a VennPrediction object).
Source code in src/online_cp/metrics.py
online_cp.metrics.Width
¶
Bases: Metric
Width (sharpness) of a Venn multiprobability prediction.
For binary predictions: :math:p_1 - p_0.
For multiclass: mean over labels of (max − min) probability across
hypotheses.
Requires venn keyword argument (a VennPrediction object).
Source code in src/online_cp/metrics.py
online_cp.metrics.CalibrationError
¶
Bases: Metric
Expected Calibration Error (ECE) for Venn predictor outputs.
Accumulates (predicted probability, true indicator) pairs from a
stream of VennPrediction objects, enabling post-hoc ECE
computation via binning.
Two modes:
use_hypothesis=False(default): evaluates the point estimate fromvenn.point. This is the aggregated probability and is typically well-calibrated empirically.use_hypothesis=True: evaluates the correct-hypothesis probability :math:P^y(y), which is theoretically calibrated by the Venn validity guarantee (ALRW2 Theorem 6.4).
The per-step _score() returns :math:|p - \mathbf{1}\{y = k\}|
(absolute calibration gap), so metric.value gives the running mean
absolute error. Use :meth:ece for the standard binned ECE.
For binary classification, the predicted probability is :math:P(y=1).
For multiclass, probabilities are stored per-class (one-vs-rest) and
ECE is computed as a weighted average across classes.
Requires venn keyword argument (a VennPrediction object).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
use_hypothesis
|
bool
|
If True, use the correct-hypothesis probability :math: |
False
|
max_history
|
int or None
|
Maximum number of (predicted, observed) pairs to store. If None, stores all. When exceeded, oldest pairs are discarded. |
None
|
Source code in src/online_cp/metrics.py
418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 | |
predicted: NDArray
property
¶
Array of stored predicted probabilities.
observed: NDArray
property
¶
Array of stored true indicators (always 1 for correct-class prob).
ece(n_bins: int = 10, strategy: str = 'uniform') -> float
¶
Compute binned Expected Calibration Error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_bins
|
int
|
Number of bins. |
10
|
strategy
|
str
|
Binning strategy: |
"uniform"
|
Returns:
| Type | Description |
|---|---|
float
|
Weighted average of |mean_predicted - fraction_positive| across bins, weighted by bin count. |
Source code in src/online_cp/metrics.py
bin_data(n_bins: int = 10, strategy: str = 'uniform') -> tuple[NDArray, NDArray, NDArray]
¶
Return binned calibration data for plotting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_bins
|
int
|
Number of bins. |
10
|
strategy
|
str
|
Binning strategy: |
"uniform"
|
Returns:
| Name | Type | Description |
|---|---|---|
mean_predicted |
ndarray
|
Mean predicted probability per bin. |
fraction_positive |
ndarray
|
Fraction of positive outcomes per bin. |
bin_counts |
ndarray
|
Number of samples per bin. |