š§ SDPO ā Self-Distillation Policy Optimization
SDPO is the algorithm that turns the userās free-form feedback ā± into a
dense, token-level training signal for the planner ā no reward model, no
preference annotations. Itās introduced by Buening et al. (2026) and used by
NanoResearch (Xu et al., 2026 §3.3.2) for per-user planner specialisation.
The setup
We have:
- A planner $\pi_\theta$ ā implemented here as Qwen2.5-7B with a per-user LoRA adapter.
- An input
xā the orchestratorās prompt (system + retrieved skills/memories + topic). - A response
yā¼ $\pi_\theta(\cdot \mid x)$ ā the plannerās initial trajectory. - Free-form user feedback
ā±ā e.g., āprefer simpler methods, drop the MeSH-term arm.ā
We want $\pi_\theta$ to produce outputs as if it had seen ā± even when ā± isnāt in the prompt.
The trick
Treat the feedback-conditioned model as a self-teacher and the unconditioned model as the student. Both are the same model ā just two different forward passes.
flowchart LR
x[("Input x")] --> S[Student forward<br/>Ļ_Īø Ā· | x, y_<t]
x --> Tplus["Teacher forward<br/>Ļ_Īø Ā· | x, ā±, y_<t<br/>(no grad)"]
S -->|log p y_t| Loss((Loss))
Tplus -->|log p y_t| Loss
Loss -->|gradient through<br/>LoRA only| W["Update LoRA"]
The math (Eq. 14ā15)
Per-token advantage (with stop-gradient on the teacher):
\[A_t^{\mathrm{SDPO}} = \log \pi_\theta(\hat{y}_t \mid x, \mathcal{F}, y_{<t}) \;-\; \log \pi_\theta(\hat{y}_t \mid x, y_{<t})\]Policy-gradient loss (minimise):
\[\mathcal{L}_{\mathrm{SDPO}} = - \mathbb{E}_y \left[ \sum_t \mathrm{stop\_grad}(A_t^{\mathrm{SDPO}}) \cdot \log \pi_\theta(\hat{y}_t \mid x, y_{<t}) \right]\]Only the student log-probability carries a gradient. The teacherās
log-probability is .detach()ed. Together with the chain rule through the
LoRA layers, this nudges the student to produce, without ever seeing ā±,
the same distribution that the teacher (which did see ā±) would have
produced.
In code
# src/nanoresearch/planner/sdpo.py (excerpt)
student_input = torch.cat([student_prompt_ids, response_ids], dim=1)
teacher_input = torch.cat([teacher_prompt_ids, response_ids], dim=1)
# 1. Student: gradient-tracked.
student_logits = model(student_input).logits
student_logp_y = log_softmax(student_logits[..., R, :], dim=-1) \
.gather(-1, response_ids[..., None]).squeeze(-1)
# 2. Teacher: stop-gradient.
with torch.no_grad():
teacher_logits = model(teacher_input).logits
teacher_logp_y = log_softmax(teacher_logits[..., R, :], dim=-1) \
.gather(-1, response_ids[..., None]).squeeze(-1)
# 3. Per-token advantage (Eq. 15).
advantage = (teacher_logp_y - student_logp_y).detach().clamp(-5, 5)
# 4. Policy-gradient loss (Eq. 14).
loss = -(advantage * student_logp_y).mean()
loss.backward() # gradients flow only through the LoRA adapter
What āfeedback-conditionedā actually means
We avoid the simplest construction (appending ā± as a separate user turn)
because many chat templates reject consecutive user messages. Instead, we
merge the feedback into the most recent user turn:
ā¦
[USER FEEDBACK ON PRIOR ANSWER]
Prefer simpler methods. Drop the MeSH-term arm.
[END FEEDBACK]
Internalise this feedback when answering.
This keeps user/assistant alternation valid for both Qwen and Llama-family
chat templates and tested with the tiny-Llama fixture in
tests/test_sdpo.py.
Hyperparameters
| Param | Default | Notes |
|---|---|---|
| Learning rate | 1e-4 |
LoRA only, AdamW |
| Max steps per round | 50 |
One round = one batch of buffered feedback records |
| Max response tokens trained | 512 |
Truncate to bound memory on M1 Max |
| Advantage clip | ±5 |
Stabilises early updates |
| Gradient accumulation | 1 |
Increase if VRAM-tight |
| Sequence length cap | 2048 |
Left-truncate longer prompts |
Hardware budget on Apple Silicon
On an M1 Max / 32 GB, a single LoRA forward+backward at seq-len ~512 and rank 16 sits at roughly ~10 GB unified memory with fp16 weights and gradient checkpointing. Each SDPO step takes ~5ā8 seconds. A 50-step round therefore finishes in 3ā7 minutes ā comfortable to run after every feedback batch.
When does it fire?
Orchestrator.maybe_train_planner(user_id, min_examples=N) flushes the
feedback queue and dispatches one SDPO round. The chat exposes this as
/train (or the assistant suggests it when buffered feedback >= 5
records).
Why not DPO?
DPO needs pairwise preferences ā prefer A over B. SDPO needs only the
absolute response y and the feedback ā±. For a research assistant where
each user gives one rich textual critique per stage, SDPO is a much
cleaner fit.