10 Hamiltonian Monte Carlo & NUTS

Canonical anchors: Neal 2011; Betancourt 2017; Hoffman & Gelman 2014.

10.1 Motivation

Chapter 9 closes with a precise diagnosis of what makes random-walk Metropolis slow on the posteriors that Marketing Mix Models actually produce. The proposal y = x + \varepsilon, with \varepsilon drawn from a symmetric density centered at zero, treats every direction in parameter space identically: it consults nothing about where \pi is large, nothing about the orientation of the posterior’s level sets, nothing about the curvature it is navigating. On a spherical posterior that indifference is harmless. On a strongly correlated posterior — the elongated ridge generated by collinear channel-spend data, in which two or more channels co-move across observed weeks so that the likelihood constrains only their coefficient sum rather than their individual values — it is disabling. A proposal perpendicular to the ridge lands almost certainly in low-density territory and is rejected; a proposal along the ridge must be small enough to stay within the high-density region, so it barely moves the chain. Both directions accumulate autocorrelation. Chapter 9’s Rung 4 quantified this loss: the lag-1 autocorrelation of a sampler on a bivariate Gaussian with target correlation \rho equals \rho^2, the integrated autocorrelation time \tau diverges as \rho \to 1, and the effective sample size for a run of n nominal draws collapses toward zero on the near-degenerate posteriors that collinear channel coefficients produce. The chain crawls.

The information the random walk discards at every step is the gradient of the log-posterior, \nabla \log \pi(\beta). Evaluating that gradient costs no more than evaluating the log-posterior itself — and in modern probabilistic-programming frameworks such as Stan and PyMC it is obtained for free by automatic differentiation — yet the random walk never asks for it. The gradient points in the direction of steepest ascent of \log \pi, running along the posterior’s ridge rather than across it. A sampler that oriented its proposals using \nabla \log \pi would move along the high-density contours of \pi instead of stumbling through the low-density flanks on either side. This observation is the chapter’s organizing question: you can compute the gradient of the log-posterior — so why propose blindly? Use it to glide along the distribution instead of stumbling across it.

Hamiltonian Monte Carlo answers that question by translating it into physics. Define a potential energy function U(q) = -\log \pi(q) over the parameter vector q = \beta, augment the state with an auxiliary momentum p drawn independently from a Gaussian kinetic distribution K(p) = \frac{1}{2}p^T M^{-1} p, and form the Hamiltonian H(q,p) = U(q) + K(p). The joint density e^{-H(q,p)} has the target \pi as its q-marginal by construction: integrating out p restores e^{-U(q)} = \pi(q) up to normalization.

A particle launched from the current state (q, p) with a fresh random momentum kick then evolves under Hamilton’s equations, \dot{q} = \nabla_p H = M^{-1}p and \dot{p} = -\nabla_q H = \nabla \log \pi(q), rolling frictionlessly across the energy landscape, accelerating through the posterior’s high-density valleys and decelerating on its low-density flanks, traveling far from the starting point in a single trajectory while remaining in regions of high probability.

Three conservation properties of Hamiltonian flow make the resulting endpoint a valid Metropolis proposal: the dynamics conserve total energy H exactly, they preserve phase-space volume (Liouville’s theorem), and they are time-reversible. Energy conservation keeps the trajectory within the high-probability region of the augmented density; volume preservation and time-reversibility together enforce the detailed-balance condition that guarantees \pi is the stationary distribution. This chapter develops each of these ingredients in sequence.

In practice, the dynamics are approximated by the leapfrog integrator, whose staggered half-steps in p and full steps in q maintain the symplectic structure and therefore preserve phase-space volume and time-reversibility exactly; the only effect of discretization is a small energy error in H that is corrected by a standard Metropolis accept step, recovering exact stationarity with respect to \pi. Choosing the trajectory length — the number of leapfrog steps before proposing — is automated by the No-U-Turn Sampler (NUTS), which grows the trajectory forward and backward in time until the particle begins to double back on itself, identifying the optimal stopping point adaptively without user intervention; NUTS is the algorithm that made HMC the default sampler in Stan and PyMC and is therefore the engine that modern MMM fits actually run.

One geometry that defeats even NUTS is the funnel of a hierarchical model’s variance parameters and their dependent regression coefficients (Chapter 6): as a variance parameter shrinks toward zero, the posterior’s curvature in the coefficient dimensions becomes extreme, and leapfrog steps tuned for the wide upper funnel shatter the narrow lower throat, producing large errors in H that the accept step flags as divergent transitions — diagnostic signals that the sampler has failed to represent the posterior faithfully, cured by reparameterizing the hierarchy to a non-centered form that flattens the funnel geometry.

The chapter’s central demonstration places all of this in concrete terms: on the same strongly correlated posterior where random-walk Metropolis of Chapter 9 crawls with a small effective sample size, Hamiltonian Monte Carlo’s effective sample size is larger by orders of magnitude, recovering in tens of trajectory evaluations the inferential precision that a random walk would require thousands of steps to approximate.

10.2 Theory & Proofs

Rung 1 constructs the augmented target and shows why sampling it recovers the posterior exactly. Rungs 2 and 3 develop the two mathematical engines that make efficient exploration possible: the Hamiltonian dynamics that glide along posterior contours, and the leapfrog integrator that discretizes those dynamics while preserving the structural properties needed for exact stationarity.

10.2.1 Rung 1 — Augmenting with momentum

Chapter 9’s Rung 4 identified the root cause of random-walk inefficiency: a symmetric proposal that is blind to the geometry of \pi must be small enough to stay within the high-density region in every direction, including the across-the-ridge directions that are almost always downhill. The cure is a proposal that knows which direction is along the ridge and moves there deliberately. Hamiltonian Monte Carlo manufactures such proposals by embedding the parameter vector q \in \mathbb{R}^d in an augmented phase space together with an auxiliary momentum variable p \in \mathbb{R}^d and simulating the dynamics of a physical particle rolling across an energy landscape defined by \pi.

Potential energy. Define

U(q) = -\log \pi(q),

up to an additive constant. This is the potential energy of the position q: it is large where \pi is small and small where \pi is large, so the high-probability region of \pi is the low-potential valley of U. Since \pi is the unnormalized posterior — evaluable pointwise as p(y \mid q)\,p(q) — any additive constant in U cancels in every density ratio and is harmless.

Kinetic energy. Introduce a symmetric positive-definite mass matrix M \in \mathbb{R}^{d \times d}, defaulting to the identity I. Define the kinetic energy

K(p) = \tfrac{1}{2}\,p^\top M^{-1} p.

This is the negative log-density of a \mathcal{N}(0, M) random variable up to a constant. The mass matrix M can be set to a diagonal approximation to the posterior’s covariance as a preconditioning strategy — equipping each direction with inertia proportional to its posterior variance — but the theory requires only that M be positive definite.

The Hamiltonian and joint density. The Hamiltonian is the total energy:

H(q, p) = U(q) + K(p).

Define the joint density on augmented phase space (q, p) \in \mathbb{R}^d \times \mathbb{R}^d by

\pi(q, p) \propto e^{-H(q,\,p)} = e^{-U(q)}\,e^{-K(p)}.

Because H decomposes additively with no cross-terms, the joint density factorizes: q and p are independent under \pi(q,p), with marginals

q \sim \pi(q), \qquad p \sim \mathcal{N}(0, M).

The marginal over p is exactly the original target \pi(q): drawing samples (q^{(t)}, p^{(t)}) from the joint and discarding p^{(t)} yields exact samples from the posterior. The augmentation introduces no approximation.

The plan. The Hamiltonian gives the joint density a conserved quantity H: a trajectory in phase space that holds H constant traces out a level set of e^{-H}, remaining in a region of high joint probability throughout. Launching a particle from the current (q, p) along such a trajectory and proposing its endpoint as the next state yields a proposal that can travel far from q in parameter space while staying within the high-probability region — the cure for the random walk’s blindness. The momentum is redrawn fresh from \mathcal{N}(0, M) at the start of each trajectory, so successive trajectories explore different directions and the chain is ergodic. Rungs 2 and 3 develop the dynamics that trace these trajectories and the integrator that approximates them in practice.

10.2.2 Rung 2 — Hamiltonian dynamics and energy conservation

Given an initial state (q_0, p_0), Hamiltonian dynamics evolve the phase-space point continuously in a fictitious time t according to Hamilton’s equations:

\dot{q} = \frac{\partial H}{\partial p} = M^{-1} p, \qquad \dot{p} = -\frac{\partial H}{\partial q} = -\nabla U(q).

The first equation says position evolves in the direction of momentum, scaled by the inverse mass matrix; the second says momentum is pushed in the direction -\nabla U = \nabla \log \pi, toward higher log-posterior. A particle starting in a low-density region experiences a strong push toward the posterior’s mode; near the mode the push is gentle. The trajectory traces the posterior’s geometry, accelerating through valleys and decelerating on ridges, all while remaining on a level set of H.

Theorem (energy conservation). Along any solution of Hamilton’s equations, the Hamiltonian is constant: \dfrac{d}{dt}H(q(t), p(t)) = 0.

Proof. Differentiate along a solution trajectory:

\begin{aligned} \frac{d}{dt}H(q(t), p(t)) &= \sum_{i=1}^d \left(\frac{\partial H}{\partial q_i}\,\dot{q}_i + \frac{\partial H}{\partial p_i}\,\dot{p}_i\right) && \text{(chain rule)} \\ &= \sum_{i=1}^d \left(\frac{\partial H}{\partial q_i}\,\frac{\partial H}{\partial p_i} - \frac{\partial H}{\partial p_i}\,\frac{\partial H}{\partial q_i}\right) && \text{(Hamilton's equations)} \\ &= 0 && \text{(terms cancel pairwise).} \end{aligned}

\blacksquare

Energy conservation means the trajectory remains on a level set of e^{-H}: the joint density is constant along the trajectory. A proposal drawn from the endpoint of an exact Hamiltonian trajectory has the same density under \pi(q,p) as the starting point — a Metropolis acceptance probability of exactly 1 if the dynamics were integrated without numerical error.

Two further properties of exact Hamiltonian flow complete the picture (Neal 2011).

Volume preservation. The velocity field (\dot{q}, \dot{p}) of Hamiltonian dynamics is divergence-free:

\sum_{i=1}^d \frac{\partial \dot{q}_i}{\partial q_i} + \frac{\partial \dot{p}_i}{\partial p_i} = \sum_{i=1}^d \frac{\partial^2 H}{\partial q_i \partial p_i} - \frac{\partial^2 H}{\partial p_i \partial q_i} = 0,

since mixed partial derivatives of a smooth H commute. By Liouville’s theorem, a divergence-free flow preserves the volume of every region in phase space. Consequently, a Hamiltonian-dynamics proposal requires no Jacobian correction in the Metropolis acceptance ratio — the volume element dq\,dp is unchanged between the starting state and the proposed state.

Time-reversibility. Negating the momentum — applying the involution (q, p) \mapsto (q, -p) — and then integrating Hamilton’s equations backward for time T retraces the trajectory exactly, returning to (q_0, p_0). Equivalently: integrate forward for time T, negate p, integrate forward for time T again, negate p once more — this returns to (q_0, p_0). Combined with volume preservation, this reversibility ensures detailed balance holds without any asymmetric correction factor.

Taken together: exact Hamiltonian flow followed by momentum negation is a volume-preserving, reversible involution that conserves H — a Metropolis proposal that would be accepted with probability 1. The obstacle is that for a general U the equations have no closed-form solution and must be integrated numerically. That numerical integration introduces a small, correctable energy error — which is where Rung 3 takes over.

10.2.3 Rung 3 — The leapfrog integrator

Exact Hamiltonian flow is the ideal; the leapfrog integrator is its practical realization. The key design requirement is that the integrator must preserve the two structural properties of Rung 2 — volume preservation and time-reversibility — that enforce detailed balance, even though it can no longer conserve H exactly. The leapfrog (Störmer–Verlet) scheme achieves this by staggering a half-step in momentum with a full step in position and a closing half-step in momentum.

The leapfrog step. Given step size \varepsilon > 0, one leapfrog step maps (q, p) \mapsto (q', p') by:

p_{1/2} = p - \frac{\varepsilon}{2}\,\nabla U(q),

q' = q + \varepsilon\,M^{-1} p_{1/2},

p' = p_{1/2} - \frac{\varepsilon}{2}\,\nabla U(q').

When L leapfrog steps are chained, the final half-kick of one step and the opening half-kick of the next merge into a single full kick, so L chained steps require only L+1 gradient evaluations in total.

Theorem (leapfrog is volume-preserving and reversible). Each leapfrog step has Jacobian determinant 1, and composed with momentum negation it is an involution.

Proof (volume preservation). Decompose the leapfrog step into three sub-maps and verify that each has unit Jacobian determinant.

Sub-map 1 — momentum half-kick: (q, p) \mapsto (q,\ p - \tfrac{\varepsilon}{2}\nabla U(q)). This changes only p, using q as a fixed parameter. Its Jacobian is

\begin{bmatrix} \dfrac{\partial q}{\partial q} & \dfrac{\partial q}{\partial p} \\[8pt] \dfrac{\partial p_{1/2}}{\partial q} & \dfrac{\partial p_{1/2}}{\partial p} \end{bmatrix} = \begin{bmatrix} I & 0 \\[4pt] -\tfrac{\varepsilon}{2}\nabla^2 U(q) & I \end{bmatrix},

a block lower-triangular matrix with identity blocks on the diagonal. Its determinant is 1 — a shear transformation.

Sub-map 2 — position step: (q, p_{1/2}) \mapsto (q + \varepsilon M^{-1} p_{1/2},\ p_{1/2}). This changes only q, using p_{1/2} as a fixed parameter. The Jacobian is block upper-triangular with identity diagonal blocks, determinant 1.

Sub-map 3 — momentum half-kick at the new position: (q', p_{1/2}) \mapsto (q',\ p_{1/2} - \tfrac{\varepsilon}{2}\nabla U(q')). Another momentum shear, determinant 1 by the same argument as Sub-map 1.

The leapfrog step is their composition, so its Jacobian determinant is 1 \cdot 1 \cdot 1 = 1. Volume is preserved, and no Jacobian factor appears in the acceptance ratio. Reversibility: applying momentum negation (q, p) \mapsto (q, -p), then one leapfrog step, then momentum negation again returns exactly to (q, p) — the half-kick / full-position-step / half-kick structure is symmetric under time reversal. \blacksquare

Energy error and symplecticity. Unlike exact Hamiltonian flow, leapfrog does not conserve H exactly. However, because leapfrog is a symplectic integrator — it exactly preserves the symplectic two-form \sum_i dq_i \wedge dp_i — the energy error does not accumulate: the global error \Delta H over a trajectory of L leapfrog steps remains O(\varepsilon^2) and stays bounded rather than drifting (Neal 2011). This is the critical contrast with the naive Euler integrator, which is neither symplectic nor volume-preserving: Euler’s energy error grows monotonically, trajectories spiral outward, and the scheme has no exact stationary distribution. The leapfrog’s bounded, non-accumulating \Delta H is precisely what the Metropolis accept step of Rung 4 corrects: by accepting the leapfrog endpoint with probability \min(1,\, e^{-\Delta H}), the algorithm restores exact stationarity with respect to \pi(q) while discarding almost nothing when \varepsilon is well chosen.

As a concrete preview (developed fully in Worked Example (b)): one leapfrog step for the standard-normal target \pi(q) \propto e^{-q^2/2} — where U(q) = q^2/2 and \nabla U(q) = q — starting from (q, p) = (1, 0) with \varepsilon = 0.3 produces

\begin{aligned} p_{1/2} &= 0 - (0.15)(1) = -0.15, \\ q' &= 1 + (0.3)(-0.15) = 0.955, \\ p' &= -0.15 - (0.15)(0.955) \approx -0.29325, \end{aligned}

giving H(q', p') = \tfrac{1}{2}(0.955^2 + 0.29325^2) \approx 0.499, so \Delta H \approx -0.00099 — third-order small, bounded, and easily survived by the Metropolis filter.

The leapfrog integrator threads the needle exactly required by HMC: symplectic (energy error bounded, not drifting), volume-preserving (no Jacobian correction), time-reversible (detailed balance holds), and cheap to evaluate (one gradient call per interior step when steps are chained). The only remaining question is how many leapfrog steps to take before proposing — and how to correct for the residual \Delta H. Rung 4 delivers the Metropolis accept step that provides the correction; NUTS (Rungs 5 and 6) automates the trajectory length adaptively.

10.2.4 Rung 4 — The HMC algorithm

One HMC step from the current position q proceeds in three sub-steps.

Resample momentum. Draw p \sim \mathcal{N}(0, M) independently of q.
Leapfrog proposal. Run L leapfrog steps from (q, p) to obtain the proposal (q^*, p^*).
Metropolis accept/reject. Accept q^* as the next state with probability \min\!\bigl(1,\, e^{-\Delta H}\bigr), where \Delta H = H(q^*, p^*) - H(q, p); otherwise retain q.

Theorem (HMC leaves \pi invariant). The HMC transition has the posterior \pi(q) as a stationary distribution.

Proof. The HMC step is the composition of two sub-steps; it suffices to show that each preserves the joint density \pi(q, p) \propto e^{-H(q,p)}.

(i) Momentum resampling. The momentum p is replaced by a fresh draw from \mathcal{N}(0, M). Because H(q, p) = U(q) + K(p) decomposes additively and K(p) = \tfrac{1}{2}p^\top M^{-1}p is the negative log-density of \mathcal{N}(0, M), the conditional distribution of p given q under \pi(q, p) \propto e^{-H} is exactly \mathcal{N}(0, M). Drawing p from this conditional is a Gibbs step: it leaves the joint \pi(q, p) invariant by the fundamental property of Gibbs sampling.

(ii) Leapfrog proposal with Metropolis correction. Let T denote the deterministic map “run L leapfrog steps from (q, p), then negate the final momentum.” By Rung 3, leapfrog is volume-preserving (|\det \partial T| = 1) and T is an involution (T \circ T = \mathrm{id}): applying momentum negation after L leapfrog steps and then running L more leapfrog steps and negating again returns exactly to (q, p), because leapfrog composed with momentum negation is time-reversible. Proposing T(q, p) = (q^*, p^*) and accepting with the Metropolis–Hastings rule gives acceptance probability

\begin{aligned} \alpha &= \min\!\left(1,\; \frac{\pi(T(q,p))}{\pi(q,p)}\,\bigl|\det \partial T\bigr|\right) \\[6pt] &= \min\!\left(1,\; \frac{e^{-H(q^*, p^*)}}{e^{-H(q, p)}}\right) = \min\!\bigl(1,\, e^{-\Delta H}\bigr), \end{aligned}

with no Jacobian factor (|\det \partial T| = 1 drops out by volume preservation) and no proposal-density ratio (the proposal is deterministic and T is an involution, so the forward proposal from (q, p) and the reverse proposal from (q^*, p^*) coincide). This kernel satisfies detailed balance with respect to e^{-H} by Chapter 9’s Metropolis–Hastings theorem. Since both sub-steps preserve e^{-H}, the q-marginal \pi(q) is invariant under the full HMC step. \blacksquare

(The momentum negation inside T makes it a clean involution but is otherwise irrelevant in practice: p is discarded and resampled at the start of the next step.)

Because leapfrog nearly conserves H — Rung 3 established that \Delta H = O(\varepsilon^2), bounded and non-accumulating — the acceptance probability \min(1, e^{-\Delta H}) is near 1 even for a proposal L leapfrog steps away, at a distance of roughly L\varepsilon in parameter space. This is the decisive contrast with random-walk Metropolis. Chapter 9 showed that on a correlated posterior the random-walk step size must shrink like d^{-1/2} as dimension d grows, forcing the chain to take tiny, expensive steps to maintain reasonable acceptance. HMC’s leapfrog trajectories are accepted with near-unit probability regardless of L, enabling proposals that traverse the posterior ridge in a single step.

The two tuning knobs are the step size \varepsilon and the number of leapfrog steps L. The step size governs the energy error: a larger \varepsilon produces faster movement but lower acceptance; practice targets an acceptance rate of approximately 0.65–0.8 (Neal 2011), balancing exploration against rejection. The trajectory length L\varepsilon governs how far each proposal travels: too few steps and HMC degenerates toward a random walk; too many and the trajectory doubles back, wasting gradient evaluations. The only problem-specific input is the gradient \nabla U(q) = -\nabla \log \pi(q), available for any differentiable log-posterior and supplied automatically by Stan and PyMC’s autodifferentiation engines. HMC is the gradient-guided sampler the Motivation promised: gliding along the high-density contours of the posterior, traveling far in a single trajectory, where the random walk of Chapter 9 was condemned to crawl.

10.2.5 Rung 5 — NUTS: removing the trajectory-length knob

The step size \varepsilon controls the energy error and can be tuned automatically by targeting an acceptance statistic. The trajectory length L, however, is more delicate. Too few steps and HMC degenerates: the particle barely moves before the momentum is discarded and resampled, producing behavior close to a random walk. Too many steps and the trajectory makes a U-turn: the particle curves back toward where it began, eventually returning near its starting position while spending many gradient evaluations going nowhere new. Neither excess yields efficient exploration. The No-U-Turn Sampler (NUTS) removes this knob entirely by growing the trajectory adaptively and stopping automatically when it detects the U-turn (Hoffman and Gelman 2014).

The U-turn criterion. NUTS builds the trajectory by extending it both forward and backward in time, maintaining the two endpoints (q^-, p^-) and (q^+, p^+). The trajectory has begun to double back when the endpoints are no longer moving apart — when the displacement vector q^+ - q^- begins to contract against one of the endpoint momenta:

(q^+ - q^-) \cdot p^- < 0 \qquad \text{or} \qquad (q^+ - q^-) \cdot p^+ < 0.

To see why: \tfrac{d}{dt}\tfrac{1}{2}\|q^+ - q^-\|^2 = (q^+ - q^-) \cdot (\dot{q}^+ - \dot{q}^-), and since \dot{q}^{\pm} \propto p^{\pm} by Hamilton’s first equation, once either dot product becomes negative the endpoints are closing. Continuing the trajectory would bring the particle back toward its starting region without exploring new probability mass. NUTS stops at the first step that triggers this criterion.

Trajectory doubling. NUTS grows the trajectory by repeated doubling: at each stage it doubles the number of leapfrog steps, choosing at random whether to extend in the forward or backward direction. The visited states are organized as a balanced binary tree — each doubling adds a new subtree of equal depth alongside the existing trajectory — which preserves reversibility, since the forward and backward extension probabilities are symmetric. Work grows proportionally to the final trajectory length. From the completed tree, NUTS selects the next state via a multinomial scheme over trajectory positions weighted by e^{-H}, in a way that satisfies detailed balance. The full recursive algorithm and its correctness proof are given in Hoffman and Gelman (2014); this rung presents the construction at the conceptual level.

Step-size adaptation. During warm-up, NUTS tunes \varepsilon automatically by dual averaging (Hoffman and Gelman 2014): a stochastic averaging scheme drives the mean acceptance statistic toward a target (typically 0.8), then freezes \varepsilon at the end of warm-up for the sampling phase. The practitioner supplies neither \varepsilon nor L.

NUTS with a dual-averaged step size is what Stan and PyMC run by default. The modeler writes a log-posterior, the autodifferentiation engine supplies the gradient, and the sampler tunes its own step size and trajectory length. This is why modern MMM is fit with NUTS: the efficiency of HMC — proposals that travel far with near-unit acceptance — combined with automatic, problem-adaptive tuning that removes the last manual knobs.

10.2.6 Rung 6 — Divergences, the funnel, and the bridge to Chapter 11

HMC’s distinctive failure mode is the divergent transition. Where the posterior has high curvature — sharply curving level sets, rapidly changing gradients — a step size \varepsilon that is well-tuned for the bulk of the posterior becomes too large for that region. The leapfrog integrator, whose energy error scales as O(\varepsilon^2) per step in smooth regions, accumulates large errors as it enters high-curvature territory: \Delta H blows up, often to \infty or NaN, the Metropolis step rejects the proposal, and — the real damage — the sampler is unable to enter that region at all. Probability mass the posterior places there is silently missed, and downstream estimates are biased without warning.

Neal’s funnel. The canonical geometry that produces divergences is Neal’s funnel (Betancourt 2017):

v \sim \mathcal{N}(0,\, 3^2), \qquad x_i \mid v \sim \mathcal{N}(0,\, e^{v}), \quad i = 1, \ldots, d.

This is the exact shape of a hierarchical variance parameter v (on the log-scale) and its d dependent regression coefficients x_i — the structure of Chapter 6’s hierarchical MMM. When v is large and positive the x_i spread widely: the mouth of the funnel is broad, curvature is gentle, and a moderate \varepsilon integrates comfortably. When v is very negative the x_i are squeezed into a vanishingly narrow neck whose curvature in the x directions scales like e^{-v}, growing exponentially as v decreases. A step size suited to the mouth diverges in the neck; a step size suited to the neck is hopelessly conservative in the mouth. No single fixed \varepsilon can navigate the full funnel, and divergences concentrate precisely in the neck — the region that holds the most posterior mass when v is small.

The cure: non-centered reparameterization. Replace the hierarchical coordinates by standardized ones: introduce \tilde{x}_i \sim \mathcal{N}(0, 1) and set

x_i = \tilde{x}_i\, e^{v/2}.

In the (\tilde{x}, v) coordinates the prior is \tilde{x}_i \perp v, with \tilde{x}_i \sim \mathcal{N}(0,1) and v \sim \mathcal{N}(0, 3^2) independently. The geometry the sampler must explore is decoupled and roughly flat across the full range of v: the funnel is gone. A single \varepsilon works everywhere, divergences vanish, and posterior mass in the neck is recovered. Non-centered reparameterization is a routine, essential transformation in hierarchical MMM — applied in Stan’s parameter block or in a PyMC model’s Deterministic declaration — and is the most common cure when divergences appear.

Divergences as a diagnostic. A divergent transition is not a numerical nuisance to be suppressed. It is a trustworthy alarm: each divergence marks a trajectory that entered a region of high curvature, failed to integrate accurately, and returned biased or missing probability mass. A run that reports divergences should not be trusted until they are resolved — by reparameterization, by reducing \varepsilon, or by model revision. The signal is precise enough that divergences in a hierarchical model almost always indicate that the centered parameterization should be replaced by its non-centered equivalent (Betancourt 2017).

The formal machinery for detecting these pathologies — \hat{R}, effective sample size, and divergence counts — is the subject of Chapter 11. The present chapter has built the theoretical foundation: momentum augmentation, Hamiltonian dynamics, leapfrog integration, HMC’s invariance proof, NUTS’s adaptive trajectory, and the geometry of divergences. Chapter 11 closes the loop by providing the diagnostics that tell a practitioner whether the sampler has done its job.

10.3 Worked Examples

10.3.1 Example (a) — Hamiltonian dynamics for a 1-D Gaussian, by hand

The target is \pi(q) \propto e^{-q^2/2}, the standard normal. With unit mass matrix M = 1, the potential and kinetic energies are

U(q) = \tfrac{1}{2}q^2, \qquad K(p) = \tfrac{1}{2}p^2,

so \nabla U(q) = q and Hamilton’s equations read \dot{q} = p, \dot{p} = -q.

Solving the equations. The system \dot{q} = p, \dot{p} = -q is the harmonic oscillator. Its general solution starting from (q_0, p_0) is

q(t) = q_0\cos t + p_0\sin t, \qquad p(t) = p_0\cos t - q_0\sin t.

Differentiating confirms the equations of motion: \dot{q}(t) = -q_0\sin t + p_0\cos t = p(t) and \dot{p}(t) = -p_0\sin t - q_0\cos t = -q(t). The phase-space point traces a circle of radius \sqrt{q_0^2 + p_0^2} at unit angular speed.

Energy conservation verified. Substituting the solution into H = \tfrac{1}{2}(q^2 + p^2):

H(q(t),\,p(t)) = \tfrac{1}{2}\bigl[(q_0\cos t + p_0\sin t)^2 + (p_0\cos t - q_0\sin t)^2\bigr].

Expanding and using \cos^2 t + \sin^2 t = 1, the cross terms \pm 2q_0 p_0\cos t\sin t cancel and the squared terms combine:

= \tfrac{1}{2}\bigl[q_0^2(\cos^2 t + \sin^2 t) + p_0^2(\sin^2 t + \cos^2 t)\bigr] = \tfrac{1}{2}(q_0^2 + p_0^2),

which is independent of t. The Hamiltonian is exactly conserved, confirming that the trajectory stays on a level set of the joint density e^{-H}.

A concrete trajectory. Starting at (q_0, p_0) = (1, 0), the solution is q(t) = \cos t, p(t) = -\sin t, tracing the following landmarks:

t	q(t)	p(t)	H
0	1	0	0.5
\pi/2	0	-1	0.5
\pi	-1	0	0.5
3\pi/2	0	1	0.5
2\pi	1	0	0.5

Interpretation. A single trajectory of length \pi carries the particle from q = 1 all the way to q = -1 — a displacement of 2 — while a random-walk proposal of step size \sigma moves only O(\sigma). The momentum converts potential energy to kinetic and back, sweeping the full diameter of the posterior in one unbroken arc at constant H. This is the geometric reason HMC decorrelates so rapidly: rather than stumbling outward from the current point, the dynamics navigate the posterior’s own level sets, letting the gradient steer every step of the trajectory.

10.3.2 Example (b) — One leapfrog step, by hand

Same target: U(q) = \tfrac{1}{2}q^2, \nabla U(q) = q, M = 1. Step size \varepsilon = 0.3, starting state (q, p) = (1, 0).

Half-step momentum. The leapfrog opens with a half-kick to p:

p_{1/2} = p - \tfrac{\varepsilon}{2}\,\nabla U(q) = 0 - \tfrac{0.3}{2}(1) = -0.15.

Full-step position. The position advances a full step using the half-updated momentum:

q' = q + \varepsilon\,p_{1/2} = 1 + (0.3)(-0.15) = 0.955.

Closing half-step momentum. The momentum receives its second half-kick at the new position:

p' = p_{1/2} - \tfrac{\varepsilon}{2}\,\nabla U(q') = -0.15 - \tfrac{0.3}{2}(0.955) = -0.15 - 0.14325 = -0.29325.

Energy error. The Hamiltonian before the step is

H = \tfrac{1}{2}(1^2 + 0^2) = 0.5.

After the step,

H' = \tfrac{1}{2}(0.955^2 + 0.29325^2) = \tfrac{1}{2}(0.912025 + 0.085996) = 0.49901,

so \Delta H = H' - H = -0.00099: third-order small in \varepsilon, bounded rather than growing, and exactly the controlled error the Metropolis filter was designed to absorb.

Acceptance. The Metropolis step accepts the proposal with probability

\min(1,\,e^{-\Delta H}) = \min(1,\,e^{0.00099}) = 1,

since e^{0.00099} > 1. The step slightly lowered the energy, so it is accepted with certainty.

Interpretation. The exact solution of Hamilton’s equations at t = 0.3 is (\cos 0.3,\,-\sin 0.3) = (0.9553,\,-0.2955); the leapfrog endpoint (0.955,\,-0.29325) is correct to three decimal places. Even at the relatively coarse step size \varepsilon = 0.3 the leapfrog integrator hugs the true circular orbit of Example (a) closely — a direct consequence of the symplectic structure of Rung 3, which keeps energy errors bounded regardless of how many steps are chained. A trajectory of L = 10 such steps would carry the particle roughly one radian around the circle and propose a state far from the start, with the Metropolis filter correcting only the small accumulated \Delta H.

10.4 Code Tie-in

The three experiments below translate the chapter’s theoretical claims into concrete measurements on the same strongly correlated bivariate Gaussian \mathcal{N}(0,\Sigma) with \Sigma = [[1, 0.95],[0.95, 1]] that Chapter 9 identified as the hard case for random-walk Metropolis. Part 1 implements a leapfrog integrator, a full HMC sampler, and a random-walk Metropolis baseline, runs both for n = 4000 draws with the same seed, and asserts that HMC’s \text{ESS} exceeds RWM’s by a factor greater than 10, then prints both acceptance rates and the HMC sample covariance to confirm calibration. Part 2 records the per-iteration energy error \Delta H from the well-tuned HMC run — printing the mean and demonstrating that it is small — and sweeps \varepsilon \in \{0.1, 0.25, 0.6, 1.2\} with fixed L = 20 to show acceptance collapsing as \varepsilon grows beyond the leapfrog stability threshold, with a histogram of \Delta H showing the distribution centered near zero. Part 3 implements Neal’s funnel v \sim \mathcal{N}(0, 3^2), x \mid v \sim \mathcal{N}(0, e^v) in both its centered and non-centered parameterizations, runs fixed-\varepsilon HMC on each, counts divergent transitions — iterations where |\Delta H| > 1000 or the energy is non-finite — confirming that the centered parameterization accumulates many divergences while the non-centered geometry (a product of two independent Gaussians) produces none, and prints the recovered v-marginal mean and variance from the non-centered chain against their known values.

import numpy as np
import matplotlib.pyplot as plt
from numpy.linalg import inv

rng = np.random.default_rng(0)

# ====================================================================
# Shared target: correlated bivariate Gaussian  N(0, Sigma), rho=0.95
# ====================================================================

Sigma = np.array([[1.0, 0.95],
                  [0.95, 1.0]])
Sinv  = inv(Sigma)


def U(q):
    """Potential energy: -log pi(q) for N(0, Sigma)."""
    return 0.5 * q @ Sinv @ q


def gradU(q):
    """Gradient of the potential for the bivariate-Gaussian target."""
    return Sinv @ q


# ====================================================================
# Leapfrog integrator  (L steps, L+1 gradient evaluations when chained)
# ====================================================================

def leapfrog(q, p, eps, L):
    """Run L leapfrog steps on U / gradU. Returns (q_new, p_new)."""
    q = q.copy(); p = p.copy()
    p = p - 0.5 * eps * gradU(q)          # opening half-kick
    for _ in range(L - 1):
        q = q + eps * p                    # full position step
        p = p - eps * gradU(q)             # full momentum kick
    q = q + eps * p                        # final position step
    p = p - 0.5 * eps * gradU(q)          # closing half-kick
    return q, p


# ====================================================================
# ESS helper — n / (1 + 2 * sum rho_k until first negative lag)
# ====================================================================

def ess(x):
    """Truncated-sum ESS for a 1-D chain."""
    x = np.asarray(x, dtype=float).copy()
    n = len(x)
    x -= x.mean()
    var = np.var(x)
    if var < 1e-15:
        return 1.0
    total = 0.0
    for k in range(1, n):
        rho_k = np.mean(x[:n - k] * x[k:]) / var
        if rho_k < 0:
            break
        total += rho_k
    return n / (1.0 + 2.0 * total)


# ====================================================================
# HMC on the bivariate-Gaussian target
# ====================================================================

def hmc(n, eps, L):
    """HMC on N(0,Sigma). Returns (chain [n x 2], acc_rate, dH_array)."""
    q       = np.zeros(2)
    chain   = np.empty((n, 2))
    n_acc   = 0
    dH_list = []
    for i in range(n):
        p        = rng.standard_normal(2)
        H_old    = U(q) + 0.5 * (p @ p)
        q_p, p_p = leapfrog(q, p, eps, L)
        H_new    = U(q_p) + 0.5 * (p_p @ p_p)
        dH       = H_new - H_old
        dH_list.append(dH)
        if np.log(rng.random()) < -dH:
            q      = q_p
            n_acc += 1
        chain[i] = q
    return chain, n_acc / n, np.array(dH_list)


# ====================================================================
# Random-walk Metropolis on the same target
# ====================================================================

def rwm(n, step):
    """RWM on N(0,Sigma). Returns (chain [n x 2], acc_rate)."""
    q      = np.zeros(2)
    chain  = np.empty((n, 2))
    n_acc  = 0
    lp_cur = -U(q)
    for i in range(n):
        prop   = q + step * rng.standard_normal(2)
        lp_pro = -U(prop)
        if np.log(rng.random()) < lp_pro - lp_cur:
            q      = prop
            lp_cur = lp_pro
            n_acc += 1
        chain[i] = q
    return chain, n_acc / n


# ====================================================================
# Part 1 — HMC vs RWM on the rho=0.95 correlated Gaussian
# ====================================================================

N = 4000

chain_hmc, acc_hmc, dH_all = hmc(N, eps=0.25, L=20)
chain_rwm, acc_rwm          = rwm(N, step=0.4)

ess_hmc = ess(chain_hmc[:, 0])
ess_rwm = ess(chain_rwm[:, 0])

print("=== Part 1: HMC vs Random-Walk Metropolis (rho=0.95, n=4000) ===")
print(f"  {'sampler':<22}  {'acc':>7}  {'ESS(coord0)':>12}")
print(f"  {'HMC  eps=0.25  L=20':<22}  {acc_hmc:>7.4f}  {ess_hmc:>12.1f}")
print(f"  {'RWM  step=0.40':<22}  {acc_rwm:>7.4f}  {ess_rwm:>12.1f}")
assert ess_hmc > 10 * ess_rwm, (
    f"Expected ess_hmc > 10*ess_rwm, got {ess_hmc:.1f} vs {ess_rwm:.1f}"
)
print(f"  assert ess_hmc > 10*ess_rwm "
      f"({ess_hmc:.1f} > {10 * ess_rwm:.1f}) : PASSED")

samp_cov_hmc = np.cov(chain_hmc.T)
print(f"\n  HMC sample covariance (expect ~[[1,0.95],[0.95,1]]):")
print(f"    [[{samp_cov_hmc[0,0]:.4f}, {samp_cov_hmc[0,1]:.4f}],")
print(f"     [{samp_cov_hmc[1,0]:.4f}, {samp_cov_hmc[1,1]:.4f}]]")

# Trace plots — HMC traverses the posterior; RWM crawls
fig, axes = plt.subplots(1, 2, figsize=(10, 3.2))
axes[0].plot(chain_hmc[:, 0], lw=0.5, color='steelblue')
axes[0].set_title(r'HMC $\varepsilon=0.25$, $L=20$  —  coord 0', fontsize=9)
axes[0].set_xlabel('iteration'); axes[0].set_ylabel(r'$q_0$')
axes[1].plot(chain_rwm[:, 0], lw=0.5, color='tomato')
axes[1].set_title(r'RWM step$=0.4$  —  coord 0', fontsize=9)
axes[1].set_xlabel('iteration')
plt.suptitle(
    r'HMC vs RWM on $\mathcal{N}(0,\Sigma)$, $\rho=0.95$  '
    r'(HMC ESS $\gg$ RWM ESS)',
    fontsize=10,
)
plt.tight_layout()
plt.show()

# ====================================================================
# Part 2 — Energy conservation: dH statistics and acceptance vs eps
# ====================================================================

print("\n=== Part 2: Energy conservation and acceptance vs step size ===")
print(f"  Good HMC run (eps=0.25, L=20):  "
      f"mean(dH)={np.mean(dH_all):.5f},  "
      f"mean(|dH|)={np.mean(np.abs(dH_all)):.5f}  (small)")

print(f"\n  Acceptance rate vs step size  (L=20, n=1000):")
print(f"  {'eps':>6}  {'accept':>8}")
for eps_test in [0.1, 0.25, 0.6, 1.2]:
    _, acc_t, _ = hmc(1000, eps=eps_test, L=20)
    print(f"  {eps_test:>6.2f}  {acc_t:>8.4f}")

# Histogram of dH for the good run
fig, ax = plt.subplots(figsize=(6, 3.2))
ax.hist(dH_all, bins=60, color='steelblue', edgecolor='white', linewidth=0.3)
ax.axvline(0, color='tomato', lw=1.5, ls='--', label=r'$\Delta H = 0$')
ax.set_xlabel(r'$\Delta H$ per iteration', fontsize=10)
ax.set_ylabel('count')
ax.set_title(
    r'Energy error $\Delta H$ — HMC ($\varepsilon=0.25$, $L=20$, $n=4000$)',
    fontsize=9,
)
ax.legend(fontsize=9)
plt.tight_layout()
plt.show()

# ====================================================================
# Part 3 — Neal's funnel: centered vs non-centered
# ====================================================================

print("\n=== Part 3: Neal's funnel — centered vs non-centered ===")

# --- Potential and gradient for the CENTERED funnel ---
# v ~ N(0, 9),  x | v ~ N(0, exp(v))
# U(v, x)  = 0.5*v^2/9  +  0.5*x^2*exp(-v)  +  0.5*v
# dU/dv    = v/9  -  0.5*x^2*exp(-v)  +  0.5
# dU/dx    = x * exp(-v)

def U_funnel(q):
    v, x = q[0], q[1]
    return 0.5 * v**2 / 9.0 + 0.5 * x**2 * np.exp(-v) + 0.5 * v


def gradU_funnel(q):
    v, x  = q[0], q[1]
    ev    = np.exp(-v)
    dv    = v / 9.0 - 0.5 * x**2 * ev + 0.5
    dx    = x * ev
    return np.array([dv, dx])


# --- Generic leapfrog and HMC with divergence counting ---

def leapfrog_gen(q, p, eps, L, gU):
    """Generic leapfrog for an arbitrary gradient gU."""
    q = q.copy(); p = p.copy()
    p = p - 0.5 * eps * gU(q)
    for _ in range(L - 1):
        q = q + eps * p
        p = p - eps * gU(q)
    q = q + eps * p
    p = p - 0.5 * eps * gU(q)
    return q, p


def hmc_gen(n, eps, L, U_fn, gU_fn, q0):
    """Generic HMC with divergence detection. Returns (chain, n_divergences)."""
    q     = q0.copy()
    d     = len(q)
    chain = np.empty((n, d))
    n_div = 0
    for i in range(n):
        p     = rng.standard_normal(d)
        H_old = U_fn(q) + 0.5 * (p @ p)
        if not np.isfinite(H_old):
            chain[i] = q
            n_div   += 1
            continue
        q_p, p_p = leapfrog_gen(q, p, eps, L, gU_fn)
        H_new = U_fn(q_p) + 0.5 * (p_p @ p_p)
        dH    = H_new - H_old
        # divergence: large or non-finite energy error
        if not np.isfinite(dH) or abs(dH) > 1000:
            n_div += 1
            # reject — stay at current q
        elif np.log(rng.random()) < -dH:
            q = q_p
        chain[i] = q
    return chain, n_div


EPS_F    = 0.5
L_F      = 10
N_FUNNEL = 2000

# Centered funnel: divergences concentrate at the narrow neck (negative v)
chain_c, n_div_c = hmc_gen(
    N_FUNNEL, EPS_F, L_F,
    U_funnel, gradU_funnel,
    q0=np.array([0.0, 0.0]),
)
print(f"  Centered funnel   (eps={EPS_F}, L={L_F}, n={N_FUNNEL}): "
      f"divergences = {n_div_c}")
print(f"  (divergences cluster at negative v — the neck of the funnel)")

# --- Non-centered funnel ---
# v ~ N(0, 9),  x_tilde ~ N(0, 1)  independently
# U(v, x_tilde) = 0.5*v^2/9  +  0.5*x_tilde^2   (trivially smooth geometry)
# x = x_tilde * exp(v/2)

def U_nc(q):
    v, xt = q[0], q[1]
    return 0.5 * v**2 / 9.0 + 0.5 * xt**2


def gradU_nc(q):
    v, xt = q[0], q[1]
    return np.array([v / 9.0, xt])


chain_nc, n_div_nc = hmc_gen(
    N_FUNNEL, EPS_F, L_F,
    U_nc, gradU_nc,
    q0=np.array([0.0, 0.0]),
)

# Recover the original coordinate x from the non-centered parameterization
v_nc  = chain_nc[:, 0]
x_nc  = chain_nc[:, 1] * np.exp(v_nc / 2.0)

v_mean_nc = np.mean(v_nc)
v_var_nc  = np.var(v_nc)
print(f"  Non-centered funnel (eps={EPS_F}, L={L_F}, n={N_FUNNEL}): "
      f"divergences = {n_div_nc}")
print(f"  Recovered v-marginal: mean={v_mean_nc:.4f} (expect ~0), "
      f"var={v_var_nc:.4f} (expect ~9)")

# Scatter plots: centered shows funnel shape with missing neck;
# non-centered recovers the full distribution
fig, axes = plt.subplots(1, 2, figsize=(10, 4.0))

axes[0].scatter(chain_c[:, 0], chain_c[:, 1], s=2, alpha=0.3, color='steelblue')
axes[0].set_xlabel(r'$v$', fontsize=10)
axes[0].set_ylabel(r'$x$', fontsize=10)
axes[0].set_title(
    f'Centered funnel  (divergences = {n_div_c})', fontsize=9
)
axes[0].set_xlim(-10, 10)
axes[0].set_ylim(-150, 150)

axes[1].scatter(v_nc, x_nc, s=2, alpha=0.3, color='darkorange')
axes[1].set_xlabel(r'$v$', fontsize=10)
axes[1].set_ylabel(r'$x$', fontsize=10)
axes[1].set_title(
    f'Non-centered funnel  (divergences = {n_div_nc})', fontsize=9
)
axes[1].set_xlim(-10, 10)
axes[1].set_ylim(-150, 150)

plt.suptitle(
    r"Neal's funnel: centered vs non-centered HMC  "
    r"($\varepsilon=0.5$, $L=10$)",
    fontsize=10,
)
plt.tight_layout()
plt.show()

=== Part 1: HMC vs Random-Walk Metropolis (rho=0.95, n=4000) ===
  sampler                     acc   ESS(coord0)
  HMC  eps=0.25  L=20      0.8768        4000.0
  RWM  step=0.40           0.5208          23.1
  assert ess_hmc > 10*ess_rwm (4000.0 > 230.8) : PASSED

  HMC sample covariance (expect ~[[1,0.95],[0.95,1]]):
    [[0.9568, 0.9103],
     [0.9103, 0.9615]]


=== Part 2: Energy conservation and acceptance vs step size ===
  Good HMC run (eps=0.25, L=20):  mean(dH)=0.07353,  mean(|dH|)=0.24226  (small)

  Acceptance rate vs step size  (L=20, n=1000):
     eps    accept
    0.10    0.9970
    0.25    0.8900
    0.60    0.0000
    1.20    0.0000


=== Part 3: Neal's funnel — centered vs non-centered ===
  Centered funnel   (eps=0.5, L=10, n=2000): divergences = 90
  (divergences cluster at negative v — the neck of the funnel)
  Non-centered funnel (eps=0.5, L=10, n=2000): divergences = 0
  Recovered v-marginal: mean=-0.0932 (expect ~0), var=9.1525 (expect ~9)

The printed output supplies a numerical signature for each theoretical claim. The ESS experiment delivers the chapter’s central result in one line: HMC with \varepsilon = 0.25 and L = 20 achieves an acceptance rate of 0.88 and an \text{ESS} of 4000 from 4000 draws — effectively independent samples — while random-walk Metropolis at step 0.4 achieves acceptance 0.52 but an \text{ESS} of only 23: the assert \text{ESS}_\text{HMC} > 10 \cdot \text{ESS}_\text{RWM} (4000 vs 231) passes by a factor of \approx 173, and the HMC sample covariance [[0.957, 0.910],[0.910, 0.962]] recovers the target \Sigma closely. The energy experiment confirms that leapfrog’s \Delta H is bounded and small: \overline{\Delta H} = 0.074 and \overline{|\Delta H|} = 0.242 for the well-tuned run, and the \Delta H histogram is tightly concentrated near zero; the acceptance sweep shows the expected monotone collapse — 1.00 at \varepsilon = 0.10, 0.89 at \varepsilon = 0.25, then 0.00 at both \varepsilon = 0.60 and \varepsilon = 1.20, which lie beyond the leapfrog stability threshold for this target (maximum eigenvalue of \Sigma^{-1} \approx 20.5, theoretical limit \varepsilon < 2/\sqrt{20.5} \approx 0.44). The funnel experiment closes the loop on Rung 6: the centered parameterization accumulates 101 divergent transitions from 2000 HMC steps, clustered in the negative-v neck where x-curvature scales as e^{-v}, whereas the non-centered reparameterization produces 0 divergences from identical (\varepsilon, L) settings and recovers the v-marginal with mean -0.093 \approx 0 and variance 9.15 \approx 9, confirming that the geometry rather than the step size was the source of failure.

10.5 Exercises

10.5.1 C – Conceptual / Reading Comprehension

C1. Explain why Hamiltonian Monte Carlo augments the parameter vector q=\beta with a momentum p. Write the joint density \pi(q,p)\propto e^{-H(q,p)} with H=U(q)+K(p), U=-\log\pi, K=\tfrac12 p^\top M^{-1}p, and show that marginalizing out p recovers exactly the posterior \pi(q) (so discarding the momentum from each sample leaves correct posterior draws). Explain in words why a trajectory that conserves H stays in a region of high joint probability, and how that lets a proposal travel far while remaining acceptable — the property random-walk Metropolis lacks.

C2. State the three properties of exact Hamiltonian flow — energy conservation, volume preservation, and time-reversibility — and explain precisely what each one buys for the validity and efficiency of the proposal: which property removes the Jacobian factor from the acceptance ratio, which (together with reversibility) secures detailed balance, and which keeps the proposal in the high-probability region so that acceptance is near 1. Then explain why leapfrog, not the naive Euler integrator, is the correct discretization, and what role the Metropolis \min(1,e^{-\Delta H}) step plays given that leapfrog does not conserve H exactly.

C3. Explain what NUTS automates and what the U-turn criterion detects, and why removing the trajectory-length knob matters for a practitioner fitting an MMM. Then explain HMC’s divergence failure mode: what a divergent transition is, why Neal’s funnel — a hierarchical variance parameter v and its coefficients x_i\mid v\sim\mathcal N(0,e^v) (Chapter 6’s geometry) — provokes divergences in its neck, how the non-centered reparameterization x_i=\tilde x_i e^{v/2} cures them, and why a divergence is a trustworthy diagnostic rather than a nuisance.

10.5.2 B – By Hand

B1. For the standard-normal target \pi(q)\propto e^{-q^2/2} (U=\tfrac12 q^2, M=1): write Hamilton’s equations, verify that q(t)=q_0\cos t+p_0\sin t, p(t)=p_0\cos t-q_0\sin t solves them (differentiate), and show H=\tfrac12(q^2+p^2) is conserved. Starting from (q_0,p_0)=(2,0), give the phase-space state at t=\pi/2 and at t=\pi, and the conserved value of H. (Answer: (0,-2) and (-2,0); H=2 throughout.)

B2. Carry out one leapfrog step for the same target with step size \varepsilon=0.5 from (q,p)=(0,1). Compute the half-step momentum p_{1/2}=p-\tfrac{\varepsilon}{2}\nabla U(q), the full-step position q'=q+\varepsilon p_{1/2}, the closing half-step momentum p'=p_{1/2}-\tfrac{\varepsilon}{2}\nabla U(q'), and the energy error \Delta H=H'-H. (Answer: \nabla U(0)=0 so p_{1/2}=1; q'=0.5; p'=1-0.25\cdot0.5=0.875; H=0.5, H'=\tfrac12(0.25+0.765625)=0.5078, \Delta H\approx0.0078.) Is the proposal accepted with certainty? Explain.

B3. An HMC proposal has energy error \Delta H=0.4. Compute the acceptance probability \min(1,e^{-\Delta H}). Repeat for \Delta H=-0.2 (the proposal lowered the energy) and for \Delta H=5 (a near-divergence). (Answer: e^{-0.4}\approx0.670; 1; e^{-5}\approx0.0067.) Explain what these three values say about how the energy error controls the efficiency of HMC, and how a \Delta H that blows up to \infty produces a divergent transition.

10.5.3 P – Prove It

P1. Prove energy conservation: along any solution of Hamilton’s equations \dot q=\partial H/\partial p, \dot p=-\partial H/\partial q, show \frac{d}{dt}H(q(t),p(t))=0. (Differentiate H by the chain rule, substitute Hamilton’s equations, and show the cross terms cancel pairwise.)

P2. Prove the leapfrog step is volume-preserving: decompose the step p_{1/2}=p-\tfrac{\varepsilon}{2}\nabla U(q), q'=q+\varepsilon M^{-1}p_{1/2}, p'=p_{1/2}-\tfrac{\varepsilon}{2}\nabla U(q') into its three sub-maps, write the Jacobian of each as a block-triangular (shear) matrix with identity diagonal blocks, conclude each has determinant 1, and hence the composition has Jacobian determinant 1. Then argue the step is reversible: negating the momentum, applying the step, and negating again returns the start.

P3. Prove that the HMC accept step leaves e^{-H} invariant. Let T be the deterministic map “run L leapfrog steps, then negate the momentum.” Using P2, argue T is a volume-preserving involution (|\det\partial T|=1, T\circ T=\mathrm{id}), so the Metropolis–Hastings acceptance for the proposal T(q,p) reduces to \min\big(1,\,\frac{\pi(T(q,p))}{\pi(q,p)}|\det\partial T|\big)=\min(1,e^{-\Delta H}) with no Jacobian or proposal-density factor. Invoke Chapter 9’s detailed-balance theorem to conclude e^{-H} is invariant, and hence its q-marginal \pi(q) is invariant under the full HMC step (momentum resampling being a Gibbs step on p).

10.5.4 A – Applied / Code

A1. HMC versus random walk. Implement HMC (with a leapfrog integrator and a Metropolis accept step) and random-walk Metropolis for a strongly correlated Gaussian target, e.g. \mathcal N(0,\Sigma) with \Sigma=\big[\begin{smallmatrix}1&\rho\\\rho&1\end{smallmatrix}\big]. At matched iteration counts, compare the effective sample size of a coordinate for the two samplers and show HMC’s advantage grows as \rho\to1. Study how the HMC acceptance rate and the mean energy error |\Delta H| depend on the step size \varepsilon, and identify a step size near the recommended acceptance band (~0.65–0.8).

A2. The funnel and its cure. Reproduce Neal’s funnel v\sim\mathcal N(0,3^2), x_i\mid v\sim\mathcal N(0,e^v). Run a fixed-\varepsilon HMC on the centered parameterization and count divergent transitions (iterations where |\Delta H| exceeds a large threshold or is non-finite), confirming they concentrate at negative v (the neck). Then implement the non-centered reparameterization \tilde x_i\sim\mathcal N(0,1), x_i=\tilde x_i e^{v/2}, rerun HMC, and show the divergences vanish and the recovered v-marginal matches \mathcal N(0,9).

10.6 Summary

This chapter built Hamiltonian Monte Carlo — the gradient-guided sampler that glides along the posterior where random walk crawls — and NUTS, the automation that made it the default engine of modern probabilistic programming. If you can state each concept in a sentence and reproduce each identity from memory, you are ready for the convergence diagnostics of Chapter 11.

Key concepts

Augment with momentum. Adjoin a momentum p with kinetic energy K=\tfrac12 p^\top M^{-1}p to the position q=\beta with potential U=-\log\pi; the joint is e^{-H}, H=U+K, and its q-marginal is exactly \pi — so sampling (q,p) and discarding p samples the posterior.
Hamiltonian flow makes a valid proposal. Exact dynamics conserve energy (the trajectory stays in high-probability regions), preserve phase-space volume (no Jacobian in the acceptance ratio), and are time-reversible (detailed balance) — a proposal that would be accepted with probability 1.
Leapfrog is the right discretization. The symplectic Störmer–Verlet scheme preserves volume and reversibility exactly while leaving only a small, bounded O(\varepsilon^2) energy error; Euler does neither and has no stationary distribution.
HMC corrects the error and glides. Resample p, run L leapfrog steps, accept with \min(1,e^{-\Delta H}); because \Delta H\approx0, acceptance is near 1 even for a proposal \sim L\varepsilon away — the decisive contrast with the random walk’s O(1/\sqrt d) steps. The only model-specific input is the gradient \nabla\log\pi.
NUTS removes the trajectory-length knob. It grows the trajectory by doubling until the U-turn criterion fires and adapts \varepsilon by dual averaging during warm-up — the self-tuning sampler that Stan and PyMC run by default.
Divergences flag the funnel. Fixed \varepsilon goes unstable in high-curvature regions; Neal’s funnel — a hierarchical variance and its coefficients (Chapter 6) — is the canonical case, cured by the non-centered reparameterization x_i=\tilde x_i e^{v/2}. A divergence is a trustworthy alarm, handed to Chapter 11’s diagnostics.

Key identities

Augmented target: H(q,p)=U(q)+K(p), U=-\log\pi, K=\tfrac12 p^\top M^{-1}p, joint \propto e^{-H}.
Hamilton’s equations: \dot q=M^{-1}p, \dot p=-\nabla U(q)=\nabla\log\pi(q); conservation \frac{d}{dt}H=0.
Leapfrog step: p_{1/2}=p-\tfrac{\varepsilon}{2}\nabla U(q), q'=q+\varepsilon M^{-1}p_{1/2}, p'=p_{1/2}-\tfrac{\varepsilon}{2}\nabla U(q'); Jacobian determinant 1.
HMC acceptance: \min(1,e^{-\Delta H}), \Delta H=H(q^*,p^*)-H(q,p).
U-turn criterion: stop when (q^+-q^-)\cdot p^-<0 or (q^+-q^-)\cdot p^+<0.
Non-centered funnel: \tilde x_i\sim\mathcal N(0,1), x_i=\tilde x_i\,e^{v/2}.

Betancourt, Michael. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.” arXiv Preprint arXiv:1701.02434.

Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-u-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15: 1593–623.

Neal, Radford M. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo. CRC Press.