Munich Datageeks e.V.
Talk "Sampling: The future of Bayesian Deep Learning?"
Screenshot of the first page of the talk

Talk "Sampling: The future of Bayesian Deep Learning?"

Felix Reuthlinger

Sampling-based Bayesian deep learning outperforms traditional approximations in both uncertainty quantification and predictive accuracy. Modern algorithms like microcanonical Langevin Monte Carlo enable efficient sampling of networks with millions of parameters.

Topic was presented at Munich Datageeks - November Edition 2025

Abstract

This talk presents a compelling case for sampling-based approaches in Bayesian deep learning, demonstrating that modern sampling methods can outperform traditional approximation techniques while remaining computationally feasible. The speaker addresses common misconceptions about Bayesian neural networks, particularly the belief that sampling is too expensive or impractical for real-world applications. By combining novel sampling algorithms like microcanonical Langevin Monte Carlo with strategic computational allocation and leveraging overparameterization properties, the approach achieves state-of-the-art uncertainty quantification and improved predictive performance. The presentation covers the theoretical foundations of Bayesian deep learning, compares various approximation methods, and demonstrates scalability from small networks to large models with millions of parameters. The Bayesian deep ensemble approach, which combines sampling with ensembling, consistently outperforms standard deep ensembles, variational inference, and Laplace approximations across multiple benchmarks while maintaining comparable computational costs to traditional optimization methods.

About the speaker

Emanuel is a PhD student at the Munich Uncertainty Quantification AI Lab at LMU and a junior member of the Munich Center for Machine Learning. His research focuses on Bayesian networks and sampling-based inference methods. Before his academic career, he worked in industry building diverse applications including learning-to-rank systems and financial risk models. His background is in mathematics, and his current work involves strong international collaborations with researchers and students working on making Bayesian deep learning more practical and scalable for real-world applications.

Transcript summary

Performance advantages of sampling-based methods

The talk opens with empirical evidence demonstrating two distinct types of performance improvements achieved through sampling-based Bayesian deep learning. The first metric is traditional predictive accuracy measured by RMSE, while the second focuses on uncertainty quantification measured by log pointwise predictive density. When comparing various methods including Laplace approximation, mean-field variational inference, deep ensembles, and sampling-based approaches, the results show that sampling methods and Bayesian deep ensembles significantly outperform other techniques by considerable margins. These performance gains motivate the exploration of why sampling should be considered a viable approach despite historical skepticism.

Fundamentals of Bayesian deep learning

Bayesian deep learning differs from standard neural networks by treating network parameters as random variables rather than fixed values. Instead of optimizing for single point estimates, the goal is to learn a distribution over parameters, which represents a distribution over possible models. The posterior distribution describes the probability of different weight configurations given the observed data. This posterior is computed using Bayes' rule, combining the likelihood function with prior beliefs about weight distributions and normalizing by the evidence. For regression tasks, the likelihood is typically Gaussian, while classification tasks use categorical distributions. The prior distribution represents initial beliefs about weights, commonly set as standard Gaussian distributions for each parameter. The evidence term, which normalizes the posterior, is intractable in nonlinear neural networks, creating the fundamental challenge of Bayesian deep learning.

Posterior predictive density and practical benefits

The ultimate objective in Bayesian deep learning is not the posterior itself but the posterior predictive density. When making predictions on new data, having access to the posterior allows sampling multiple models to generate a full distribution over predictions rather than single point estimates. This distributional view provides several practical advantages. First, it naturally produces credible intervals that quantify uncertainty. Second, it enables better decision-making in applications requiring confidence estimates, such as allowing systems to abstain from actions when uncertainty is high. Third, it supports robust model averaging through Bayesian model averaging, where predictions are aggregated across samples from the posterior. This empirical averaging over likely models provides a powerful framework for uncertainty quantification and improves understanding of model behavior.

Approximation methods for intractable posteriors

Since exact posterior computation is intractable for neural networks, various approximation methods have been developed. Maximum a posteriori estimation represents the crudest approximation, finding a single optimal point and essentially reducing to standard optimization with regularization when using appropriate priors. Variational inference takes a more sophisticated approach by selecting a parametric family of distributions, such as factorized Gaussians, and learning distribution parameters instead of point estimates. Mean-field variational inference specifically assumes independence between parameters, requiring learning both location and scale parameters for each weight. Laplace approximation places a Gaussian distribution around an optimized MAP solution, providing a local approximation that is computationally stable and scalable. Deep ensembles train multiple networks independently and use agreement and disagreement between predictions as a proxy for model uncertainty, representing one of the most practically successful approaches despite not being truly Bayesian.

Markov chain Monte Carlo sampling

MCMC methods provide a fundamentally different approach by directly sampling from the posterior distribution without requiring the intractable normalization constant. These methods produce implicit approximations through collections of samples rather than explicit parametric distributions. The samples obtained from MCMC are exactly what is needed for computing the posterior predictive density, making them directly applicable for inference. While other methods like variational inference or Laplace approximations produce explicit distributions that must be sampled during inference, MCMC provides these samples directly. The challenge is that neural network loss landscapes are highly non-convex and complex, comparable to mountainous terrain like the Alps with many ridges and valleys. Characterizing this entire landscape through sampling seems daunting, contributing to the perception that MCMC is impractical for deep learning.

Common myths and misconceptions

Several persistent beliefs have discouraged adoption of sampling methods in deep learning. The first misconception concerns the meaningfulness of priors in high-dimensional weight spaces, questioning whether sensible prior beliefs can be specified for millions of parameters. The second myth holds that Markov chain Monte Carlo is inherently too expensive for practical use. The third concerns scalability, with the belief that sampling only works for small-scale problems. The fourth involves the intricate geometry of neural network loss landscapes, including singularities and symmetries that complicate navigation. These factors have led to viewing MCMC as an infeasible gold standard that, while theoretically providing correct asymptotic guarantees without restrictive assumptions, remains impractical when applied out-of-the-box. The remainder of the talk systematically addresses each misconception with recent research findings.

Meaningful priors in high dimensions

Recent research demonstrates that simple isotropic priors, particularly Gaussian priors, are more suitable for neural networks than previously thought. There exists a close correspondence between Gaussian priors and L2 regularization commonly used in optimization. When combined with overparameterization theory, several beneficial phenomena emerge in high-dimensional networks. Empirical evidence from millions of posterior samples shows striking differences between underparameterized and overparameterized networks. In small, underparameterized toy networks, the posterior exhibits complex symmetries and multimodal structures that are difficult to characterize. However, in realistic overparameterized networks used in practice, the posterior actually aligns well with the prior distribution. This alignment suggests that high dimensionality in deep learning works favorably with simple isotropic priors, making the posterior more well-suited for sampling than intuition might suggest. The overparameterization common in modern deep learning thus becomes an advantage rather than an obstacle.

Computational efficiency and runtime comparisons

Detailed runtime analysis reveals that sampling is not necessarily expensive compared to standard optimization. When measuring time in minutes, deep ensemble baselines represent the cost of classical optimization for training neural networks. State-of-the-art samplers like NUTS (No-U-Turn Sampler) based on Hamiltonian Monte Carlo show approximately ten-fold overhead compared to optimization. However, the proposed sampling approach achieves runtime in the same range as standard optimization and runs roughly ten times faster than NUTS. This means that if a network can be optimized, it can be sampled with similar computational resources. When comparing against other approximation methods like mean-field variational inference and Laplace approximations, these supposedly faster methods deliver poor performance despite their computational advantages. Even when investing equal or greater runtime into deep ensembles, they cannot match the uncertainty quantification performance achieved by sampling-based methods. For larger networks, variational inference often requires comparable or greater runtime than the proposed sampling approach due to high gradient variance.

Novel sampling algorithms: microcanonical Langevin Monte Carlo

The computational efficiency stems partly from employing novel sampling algorithms, specifically microcanonical Langevin Monte Carlo (MCLMC), which appeared recently and represents a significant advancement. While Hamiltonian Monte Carlo simulates a physical system where neural network weights act as particles evolving through time, canonical HMC requires momentum resampling with both drift and diffusion terms. Microcanonical HMC restricts sampling to a single energy surface, enabling more deterministic sampling behavior. This restriction provides crucial advantages: faster sampling and improved scaling properties. Under certain conditions, MCLMC exhibits convergence rates that are completely independent of dimensionality, effectively bypassing the curse of dimensionality under appropriate assumptions. This dimension-independence is critical for deep learning applications. The combination of samplers that scale well to high dimensions with overparameterized networks that exhibit well-behaved posteriors creates the synergy enabling practical sampling.

Computational budget allocation strategies

Beyond improved samplers, strategic allocation of computational resources significantly impacts performance. Sampling introduces Monte Carlo error from finite computational resources. Traditional Bayesian statistics emphasizes running very long sequential chains for good mixing. However, empirical analysis through performance heatmaps reveals that the number of parallel chains (exploration) and the number of samples per chain (exploitation) have different impacts. The heatmaps show that investing in parallel chains provides substantially more benefit than extending individual chains, with the relationship being linear in the number of chains but logarithmic in the number of samples per chain. The optimal strategy lies in balancing both factors, but favoring parallel exploration. The practical workflow begins with warm-starting from optimized solutions rather than random initialization, since samplers struggle with random starting points. Multiple networks are optimized in parallel through standard methods, then local exploration occurs through short sampling chains started from these optimized points. This approach combines deep ensemble optimization with local sampling-based exploration to create high-performing model ensembles with robust epistemic uncertainty estimates.

Software tools and implementations

Several software packages facilitate practical implementation of these sampling methods. For PyTorch users, the posteriors package provides a plug-and-play solution that is well-tested and ready for production use, additionally featuring variational inference and Laplace approximations. While the speaker's personal research stack relies entirely on JAX for its speed advantages, the posteriors package enables baseline comparisons needed for research papers. Current development includes a scikit-learn compatible Bayesian sampling package specialized for tabular data, being finalized by students Angelos and Jonas. This package will scale to large models and is fully powered by JAX to ensure high performance. These tools aim to make sampling-based Bayesian deep learning accessible to practitioners without requiring deep expertise in MCMC methods or Bayesian statistics.

Scaling capabilities and historical progress

The misconception that sampling only works at small scale has been thoroughly debunked by recent achievements. In 2021, Izmailov et al. at NYU, in collaboration with Google, used 512 TPUs for weeks to sample a single ResNet-20 model with approximately 10 million parameters, representing a massive computational investment. Current methods achieve comparable or superior performance sampling the same model in just two hours using a handful of GPUs, demonstrating dramatic efficiency improvements. Recent benchmarks include fine-tuning an 8 billion parameter Llama 3 model, showing that large language models can benefit from sampling approaches. Other examples include a 22 million parameter vision transformer (currently under review), a 73 million parameter CNN by Paulin at AI Stats, and various architectures including nano-GPT and MLPs. This represents a significant departure from classical Bayesian statistics' scale limitations. The progression is remarkable: when the speaker began PhD research, typical networks contained 100 to 1,000 parameters and required days or weeks of computation. Today's methods handle networks with millions of parameters comparable to those used in production systems.

Implementation complexity and hyperparameter tuning

An interesting practical observation is that configuring sampling methods can actually be simpler than properly tuning standard optimization or other approximation approaches. For the vision transformer work, correctly configuring standard optimization took longer than setting up the sampling procedure. The hyperparameters for optimization-based methods and optimization-based posterior approximations often prove more challenging to tune properly than sampling hyperparameters. This contradicts the expectation that sampling would introduce additional complexity. The approach follows a recipe that, once understood, transfers across different architectures and tasks. The plug-and-play nature extends across MLPs for regression and classification, vision transformers, CNNs, and even text generation applications. The sampling procedure applies generically after the standard training setup, requiring minimal architecture-specific adjustments.

Critical evaluation and practical considerations

Despite the advantages, sampling is not a universal solution for all problems. Several important questions guide appropriate application of these methods. Regarding sampling from the true posterior, the methods remain approximations despite asymptotic guarantees, so certainty about reaching the true posterior is limited. However, sampling is more flexible than other approximations and consistently delivers better performance. The common concern about needing thousands of chains for multimodal problems has been addressed empirically: for architectures including MLPs for distribution regression and ResNet-18 on CIFAR-10, ten chains visit all functionally relevant modes. Functionally relevant modes are those that matter for maximizing predictive uncertainty quantification metrics in downstream tasks. The resulting functional uncertainty appears valid, with credible intervals showing excellent calibration. Performance comparisons show superiority over restrictive optimization-based approximations including Monte Carlo dropout, which was discussed in previous Munich Datageeks talks. Even with equal computational budgets, sampling methods often significantly outperform vanilla ensembles in uncertainty quantification metrics while maintaining an edge in predictive performance.

Scalability trajectory and future directions

The scaling capabilities have improved dramatically over recent years and continue to advance. When the speaker began PhD work, experiments involved networks with 100 to 1,000 parameters requiring days or weeks of computation. Current work regularly handles networks with millions of parameters that are directly relevant to practical applications. Multiple research groups actively work on pushing these limits further. The trajectory shows consistent progress toward making sampling feasible for increasingly large and complex models encountered in real-world applications. This progression suggests that the gap between research-scale demonstrations and production-scale deployments continues to narrow.

Inference speed considerations

A practical challenge shared across many approximate Bayesian inference methods involves inference speed. Computing the posterior predictive distribution requires storing multiple samples in memory and performing multiple forward passes at prediction time, creating both storage and computational costs. Research papers in this domain often require multiple terabytes of storage due to extensive experimentation with many samples. At prediction time, generating the empirical distribution over predictions demands resources proportional to the number of samples. However, this cost applies equally to all methods that rely on sampling for uncertainty quantification, including deep ensembles. For applications requiring uncertainty quantification, this overhead is typically worthwhile, particularly since inference can be embarrassingly parallel across multiple cores. Current research addresses these concerns through distillation and compression techniques for posterior samples. Published work from related research groups and promising proof-of-concepts demonstrate potential speedup factors of 100 or more through these techniques. While latency matters significantly in industry applications, these optimizations can make sampling-based uncertainty quantification practical even in latency-sensitive scenarios.

Key takeaways and practical recommendations

Following the presented recipes enables feasible sampling for many applications in practice. The approach is fast to run and relatively easy to implement, especially using functional APIs with packages like posteriors for the Bayesian deep ensemble approach. Several persistent myths and misconceptions about sampling in deep learning have been addressed with empirical evidence. The methods provide state-of-the-art uncertainty quantification while additionally boosting predictive performance beyond just uncertainty benefits. Generalization improvements have been demonstrated across various tasks, with some problems showing RMSE reductions by half. However, there is no free lunch - the value of this approach varies by task and application requirements. The decision to adopt sampling methods should consider the specific needs of the problem, particularly the importance of robust uncertainty quantification. For practitioners interested in these techniques, detailed recipes are available in published papers with more results forthcoming in upcoming publications.