TL;DR
Unimodal policies, such as deterministic policies, compute the behavior as the center of mass and therefore lead to misleading regularization; while we harnesses the flexibility of diffusion models, and the regularization is calculated as the accumulated discrepancies in diffusion directions of the actor and the behavior diffusion. This motivates BDPO, our efficient algorithm for optimizing diffusion policies.
Abstract
Behavior regularization, which constrains the policy to stay close to some behavior policy, is widely used in offline reinforcement learning to manage the risk of hazardous exploitation of unseen actions. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance.
🧗 Difficulties of Combining Diffusion Policies and Offline RL
The popularized behavior-regularized RL framework designs to optimize policies $\pi$ while also constraining them towards a behavior policy $\nu$: $$\max_\pi\ \mathbb{E}_{a\sim\pi(\cdot|s)}\left[Q(s, a)\right]-\eta D_\mathrm{KL}(\pi\|\nu).$$ By specifying $\nu$ as ...
- the uniform policy $U[\mathcal{A}]$: it encourages diversity and exploration of the policy, recovering Maximum Entropy RL;
- the dataset collection policy $\pi_{\mathcal{D}}$: it restricts the policy to center around the dataset $\mathcal{D}$, balancing optimization and safety;
- the pre-trained or sft policy $\pi_{\mathrm{ref}}$: it prevent the mode collapse during post-training.
❌ Problem 1: Regularization Calculation
Traditional regularization, such as KL divergence, requires action densities, which is not directly available for diffusion policies.
❌ Problem 2: Diffusion Policy Optimization
Traditional training objectives of diffusion models require samples from the target distribution, while in offline RL, we only have access to samples from the behavior policy $\nu$.
🧩 The Ingredients of BDPO
1. Pathwise KL Regularization
Instead of the considering the KL divergence between action distributions, we consider the KL divergence defined on diffusion paths, which can be further decomposed into the accumulated discrepancies in reverse diffusion directions: $$ \begin{aligned} &D_{\rm KL}(p^{\pi, s}_{0:N} \| p^{\nu, s}_{0:N})\\ &=\mathbb{E}\left[\log \frac{p_N^{\pi,s}(a^N)\prod_{n=1}^Np_{n-1|n}^{\pi,s}(a^{n-1}|a^n)}{p_N^{\nu,s}(a^N)\prod_{n=1}^Np_{n-1|n}^{\nu,s}(a^{n-1}|a^n)}\right]\\ &=\mathbb{E}\left[\sum_{n=1}^{N}D_{\rm KL}(p^{\pi, s,a^n}_{n-1|n}\|p^{\nu, s,a^n}_{n-1|n})\right]\\ &=\mathbb{E}\left[\sum_{n=1}^{N}\frac{\|\mu^{\pi,s}_n(a^n)-\mu^{\nu,s}_n(a^n)\|^2}{2\sigma_n^2}\right]. \end{aligned} $$ We therefore consider using the pathwise KL regularized objective: $$ \max_{p^\pi}\ \mathbb{E}_{a^0\sim p^{\pi,s}}\left[Q(s, a^0)\right]-\eta D_{\rm KL}(p^{\pi, s}_{0:N} \| p^{\nu, s}_{0:N}). $$
✔ We proved the equivalency between pathwise KL regularization and KL regularization.
✔ The pathwise KL is consistent with the KL between two diffusion SDEs in the limit of infinitesimal step sizes.
2. Two-time-scale Actor-Critic
Now we are dealing with a two-time-scale RL algorithm. The upper level works in the environment time steps, which specifies the reward minus the penalty as single step reward;
the lower level works completely inside the diffusion time steps, which treats single step KL as the reward.
Similar to how temporal difference (TD) learning amortizes the cost for trajectory optimization, we employ two-time-scale TD learning:
1. (Upper Level): The action value functions $Q^\pi(s, a)$ are estimated by standard TD learning:
$$Q^\pi(s, a)\leftarrow R(s, a) + \gamma \mathbb{E}_{a'^{0:N}\sim p^{\pi,s}_{0:N}}\left[Q^\pi(s', a'^0)-\sum_{n=1}^ND_{\rm KL}(p^{\pi,s,a'^n}_{n-1|n}\|p^{\nu,s,a'^n}_{n-1|n})\right]$$
2. (Lower Level): The diffusion value functions $V^{\pi, s}_n$ are estimated by TD learning between single step reverse diffusion:
$$
\begin{aligned}
&V_0^{\pi,s}(a^0)\leftarrow Q^\pi(s,a^0),\\
&V_n^{\pi,s}(a^n)\leftarrow-\eta D_{\rm KL}(p^{\pi,s,a^n}_{n-1|n}\|p^{\nu,s,a^n}_{n-1|n})+\mathbb{E}_{a^{n-1}\sim p^{\pi,s,a^n}_{n-1|n}}\left[V_{n-1}^{\pi,s}(a^{n-1})\right].
\end{aligned}
$$
3. (Policy Improvement): The diffusion policy now only requires single step diffusion to improve!
$$
\begin{aligned}
&\max_{p^{\pi,s,a^n}_{n-1|n}}\ \ -\eta\ell^{\pi,s}_n(a^n) + \underset{p^{\pi,s,a^n}_{n-1|n}}{\mathbb{E}}\left[V^{\pi,s}_{n-1}(a^{n-1})\right].
\end{aligned}
$$
✔ We proved the convergence of the two-time-scale actor-critic algorithm.
🔍 Key Empirical Findings
Synthetic 2D Tasks
1, Dots and lines depict the diffusion generation path from the policy: BDPO accurately fits the target distribution (the rightmost column).
2, Background color depicts the landscape of diffusion value function over the entire space: In the initial steps ($n=50$ to $n=40$), the sample movement is subtle, while in the later steps ($n=30$ to $n=0$), the samples rapidly converge to the nearest modes of the data.
D4RL Benchmark Results

Diffusion-based methods, particularly those with diffusion-based actor and regularization (including BDPO, DAC, and Diffusion-QL), substantially outperform their non-diffusion counterparts, especially in locomotion tasks. Meanwhile, BDPO consistently achieves superior performance across nearly all datasets, underscoring the effectiveness of combining diffusion policies and the behavior-regularized RL framework.
Runtime Analysis
BDPO runs efficiently.
BDPO consists of three key steps: pretraining behavior diffusion, training value functions ($Q^\pi$ and $V^{\pi,s}_n$), and optimizing the actor.
The pertaining phase takes around 8 minutes and is therefore negligible.
The $Q^\pi$ training follows the same approach as Diffusion-QL and DAC, while $V^{\pi,s}_n$ introduces acceptable overhead since it only requires single-step diffusion.
For actor training, both DAC and BDPO use single-step diffusion, while Diffusion-QL needs to back-propagate through the entire diffusion path, leading to a significantly higher runtime.
🏁 Closing Remarks
- Framing the reverse process of diffusion models as an MDP, we propose to implement the KL divergence w.r.t. the diffusion generation path, rather than the clean action samples;
- Building upon this foundation, we propose a two-time-scale actor-critic method to optimize diffusion policies. Instead of differentiating the policy along the entire diffusion path, BDPO estimates the values at intermediate diffusion steps to amortize the optimization, offering efficient computation, convergence guarantee and state-of-the-art performance;
- Experiments conducted on synthetic 2D datasets reveal that our method effectively approximates the target distribution. Furthermore, when applied to continuous control tasks provided by D4RL, BDPO demonstrates superior performance compared to baseline offline RL algorithms.
${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$
@article{gao2025behavior,
title={Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning},
author={Gao, Chen-Xiao and Wu, Chenyang and Cao, Mingjun and Xiao, Chenjun and Yu, Yang and Zhang, Zongzhang},
journal={arXiv preprint arXiv:2502.04778},
year={2025}
}
This website template is based on scaling-crl.github.io by Kevin Wang.