Low-Resource Guidance for Controllable Latent Audio Diffusion

Low-Resource Guidance for Controllable Latent Audio Diffusion

Zachary Novack^#♭¶ Zack Zukowski^♭ CJ Carr^♭ Julian Parker^♭
Zach Evans^♭ Josiah Taylor^♭ Taylor Berg-Kirkpatrick^# Julian McAuley^# Jordi Pons^♭

^#University of California, San Diego
^♭Stability AI
^¶Work done while an intern at Stability AI

Paper 🤗 HF Paper Blog Post

Abstract

Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (e.g., guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per- step due to decoder backpropagation, we introduce a guidance- based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and ≈ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance.

Random Generations

This section presents random audio samples generated by the five different methods: End-to-End (E2E), Readouts, Forward-Simulated LatCHs (LatCH-F), and Backward-Simulated LatCHs (LatCH-B). For each sample, we also show the prompt and control used. For beats / pitch controls, we show the reference audio where extracted controls are from. For volume controls, we include the shape description of the curve in each section.