Prompt complexity estimation
A local LLM scores prompt complexity in real time so the scheduler can distinguish simple generations from prompts that need a larger denoising budget.
Model-Agnostic Adaptive Inference
I built a model-agnostic adaptive inference framework that cuts Stable Diffusion v1.5 latency without training or fine-tuning. It combines local LLM prompt complexity estimation with latent-convergence early stopping, reducing average runtime from 70.84s to 39.17s while preserving CLIP alignment.
Text-to-image systems are often bounded less by model quality than by runtime. Stable Diffusion v1.5 can produce strong outputs, but long denoising schedules push latency high enough to become a product constraint.
The challenge was to remove redundant inference work without retraining, fine-tuning, or locking the solution to a single model internals hack. The system had to stay practical, inspectable, and quality-aware at runtime.
I treated the problem as adaptive inference rather than static step pruning. Instead of assigning one denoising budget to every prompt, the framework estimates prompt complexity locally and allocates steps where the request is likely to need them.
That scheduling layer is paired with latent-convergence early stopping so the pipeline can terminate once additional denoising stops producing meaningful change. The result is a model-agnostic control loop that reduces latency without training or fine-tuning.
A local LLM scores prompt complexity in real time so the scheduler can distinguish simple generations from prompts that need a larger denoising budget.
The runtime controller maps prompt complexity to an inference schedule instead of applying one fixed step count to every request.
A convergence check monitors latent updates during sampling and exits early once additional steps stop producing meaningful progress.
The evaluation loop compares adaptive runs against a fixed SD v1.5 baseline using runtime and CLIP alignment so latency claims stay tied to output quality.
An interactive interface exposes baseline and adaptive outputs side by side for fast inspection, prompt testing, and operator-facing demos.
Across the benchmark set, the adaptive pipeline reduced average generation time from 70.84 seconds to 39.17 seconds.
CLIP alignment remained near-identical at 0.6556 for the adaptive pipeline versus 0.6565 for the fixed baseline.
The latency gains came from runtime control alone, which keeps the framework easy to port and cheaper to evaluate than training-heavy alternatives.
The Streamlit app made it easy to compare prompts, inspect outputs, and validate whether the scheduler was pruning compute in the right places.
The framework is model-agnostic by design, but the current validation is centered on Stable Diffusion v1.5. The next step is benchmarking the controller on newer diffusion backbones.
I want broader coverage across composition-heavy, style-sensitive, and edge-case prompts to measure where adaptive scheduling remains conservative enough.
There is room to refine the latent-convergence signal so the controller exits earlier on easy prompts without clipping detail on harder generations.