beta-kde: Beta Kernel Density Estimation in Python

A Scikit-learn compatible Python package for Beta Kernel Density Estimation (KDE), providing fast, finite-sample boundary correction for bounded data.

beta-kde fixes the Boundary Bias problem inherent in standard Gaussian KDEs, where probability mass "leaks" past the edges of the data (e.g., below 0 or above 1).

This package is the official implementation of the paper:

A Fast, Closed-Form Bandwidth Selector for the Beta Kernel Density Estimator Johan Hallberg Szabadváry (2026) Journal of Computational and Graphical Statistics. DOI: 10.1080/10618600.2026.2670662 | arXiv preprint

📊 The Problem vs. The Solution

Standard KDEs smooth data blindly, ignoring bounds. beta-kde uses asymmetric Beta kernels that naturally adapt their shape near boundaries to prevent leakage.

Boundary Bias Comparison

🚀 Key Features

Boundary Correction: Zero leakage. Probability mass stays strictly within the defined bounds.
Fast Bandwidth Selection: Implements the Beta Reference Rule, a closed-form \(\mathcal{O}(1)\) selector proposed in Hallberg Szabadváry (2026). It matches the accuracy of expensive Cross-Validation but is orders of magnitude faster.
Multivariate Support: Models multivariate bounded data using a Non-Parametric Beta Copula.
Scikit-learn API: Drop-in replacement for KernelDensity. Fully compatible with GridSearchCV, Pipeline, and cross_val_score.

📦 Installation

pip install beta-kde

⚡ Quick Start

💡 See the Tutorial Notebook for detailed examples, including visualization and classification.

1. Univariate Data (The Standard Case)

BetaKDE enforces Scikit-learn's 2D input standard (n_samples, n_features).

import numpy as np
from beta_kde import BetaKDE
import matplotlib.pyplot as plt

# 1. Generate bounded data (e.g., ratios or probabilities)
np.random.seed(42)
X = np.random.beta(2, 5, size=(100, 1))  # Must be 2D column vector

# 2. Fit the estimator
# 'beta-reference' is the fast, default rule-of-thumb from the paper
kde = BetaKDE(bandwidth='beta-reference', bounds=(0, 1))
kde.fit(X)

print(f"Selected Bandwidth: {kde.bandwidth_:.4f}")

# 3. Score samples
# Use normalized=True to get exact log-likelihoods (integrates to 1.0).
# Default is False for speed (returns raw kernel density values).
log_density = kde.score_samples(np.array([[0.1], [0.5], [0.9]]), normalized=True)

# 4. Plotting convenience
fig, ax = kde.plot()
plt.show()

2. Multivariate Data (Copula)

For multidimensional data, BetaKDE fits marginals independently and models dependence using a Copula.

2D Copula Plot

# Generate correlated 2D data
X_2d = np.random.rand(200, 2) 

# Fit (automatically uses Copula for n_features > 1)
kde_multi = BetaKDE(bandwidth='beta-reference')
kde_multi.fit(X_2d)

# Returns log-likelihood of the joint distribution
scores = kde_multi.score_samples(X_2d)

# Plotting convenience (in the multi-variate case, this plots the marginal densities)
fig, ax = kde.plot()
plt.show()

3. Scikit-learn Compatibility (e.g. Hyperparameter Tuning)

beta-kde is a fully compliant Scikit-learn estimator. You can use it in Pipelines or with GridSearchCV to find the optimal bandwidth.

Note: The estimator automatically handles normalization during scoring to ensure valid statistical comparisons.

from sklearn.model_selection import GridSearchCV

# Define a grid of bandwidths to test
param_grid = {
    'bandwidth': [0.01, 0.05, 0.1, 'beta-reference']
}

# Run Grid Search
# n_jobs=-1 is recommended to parallelize the numerical integration
grid = GridSearchCV(
    BetaKDE(),
    param_grid,
    cv=5,
    n_jobs=-1
)

grid.fit(X)

print(f"Best Bandwidth: {grid.best_params_['bandwidth']}")
print(f"Best Log-Likelihood: {grid.best_score_:.4f}")

⚡ Performance & Normalization

Unlike Gaussian KDEs, Beta KDEs do not integrate to 1.0 analytically. Normalization requires numerical integration, which can be computationally expensive. beta-kde handles this smartly: * Lazy Loading: fit(X) is fast and does not compute the normalization constant. * On-Demand: The integral is computed and cached only when you strictly need it (e.g., calling kde.score(X) or kde.pdf(X, normalized=True)). * Flexible Scoring: * score_samples(X, normalized=False) (Default): Fast. Best for clustering, relative density comparisons, or plotting shape. * score(X): Accurate. Always returns the normalized total log-likelihood. Safe for use in GridSearchCV.

🆚 Why use beta-kde?

If your data represents percentages, probabilities, or physical constraints (e.g., \(x \in [0, 1]\)), standard KDEs are mathematically incorrect at the edges.

Feature	`sklearn.neighbors.KernelDensity`	`beta-kde`
Kernel	Gaussian (Symmetric)	Beta (Asymmetric)
Boundary Handling	Biased (Leaks mass < 0)	Correct (Strictly \(\ge 0\))
Bandwidth Selection	Gaussian Reference Rule	Beta Reference Rule
Multivariate	Symmetric Gaussian Blob	Flexible Non-Parametric Copula
Speed (Prediction)	Fast (Tree-based)	Moderate (Exact summation)

⚠️ Important Usage Notes

Strict Input Shapes: Input X must be 2D. Use X.reshape(-1, 1) for 1D arrays. This constraint prevents accidental application of univariate estimators to multivariate data.
Computational Complexity: This is an exact kernel method.
- Raw density (normalized=False) is fast.
- Exact probabilities (normalized=True) require a one-time integration cost per fitted model.
- Recommended for datasets with \(N < 50,000\).
Bounds: You must specify bounds if your data is not in \([0, 1]\). The estimator handles scaling internally.

Other Implementations

The Hallberg-Szabadváry bandwidth selector is also natively available in: * R: Implemented as the default selector for beta kernels in the CRAN package kdensity (v1.2.0+). Use it by setting bw = "HS". * Julia: Available in the BetaKDE.jl package.

📚 References

Chen, S. X. (1999). Beta kernel estimators for density functions. Computational Statistics & Data Analysis, 31(2), 131-145.
Hallberg Szabadváry, J. (2026). A Fast, Closed-Form Bandwidth Selector for the Beta Kernel Density Estimator. Journal of Computational and Graphical Statistics. DOI: 10.1080/10618600.2026.2670662

Citation

If you use this package in your research, please cite the accompanying paper:

@article{szabadvary2026fast,
  author = {Hallberg Szabadv{\'a}ry, Johan},
  title = {A Fast, Closed-Form Bandwidth Selector for the Beta Kernel Density Estimator},
  journal = {Journal of Computational and Graphical Statistics},
  year = {2026},
  doi = {10.1080/10618600.2026.2670662}
}

License

BSD 3-Clause License