[AAAI2026] Quality‑Aware Language‑Conditioned Local Auto‑Regressive Anomaly Synthesis and Detection
QARAD couples a language‑conditioned, mask‑local autoregressive editor with a quality‑aware re‑weighting scheme to synthesize realistic, precisely located anomalies and train stronger anomaly detectors.
QARAD is a two‑component framework for industrial anomaly detection:
-
ARAS — Auto‑Regressive Anomaly Synthesis A training‑free, language‑guided, mask‑local editor that injects fine‑grained defects only where you ask, while freezing the surrounding context to preserve micro‑structure and material continuity.
-
QAW — Quality‑Aware Weighting A simple, detector‑agnostic re‑weighting that amplifies high‑consistency synthetic samples (measured via image–text alignment) and down‑weights low‑consistency ones, stabilizing optimization and improving generalization.
Together, these form QARAD, a synthesis‑plus‑training pipeline that delivers controllable, realistic defects and robust, accurate detectors across standard benchmarks.
-
Mask‑Local, Language‑Conditioned Editing (ARAS). We introduce a hard‑gated autoregressive operator over VQ latents that freezes all tokens outside a user‑provided mask and samples only within the mask, conditioned on a natural‑language prompt. This guarantees exact locality and context invariance, enabling precise, text‑guided defect placement with sub‑pixel fidelity.
-
Quality‑Aware Re‑Weighting (QAW). We compute an image–text similarity per synthetic sample and convert it into a continuous weight for the detector’s loss. High‑consistency syntheses receive larger gradients; low‑consistency ones are softly attenuated—reducing gradient variance while preserving diversity.
-
Decoupled, Plug‑and‑Play Design. ARAS is training‑free and can be dropped into existing AD pipelines; QAW is detector‑agnostic and only changes training weights, not model architectures.
-
Strong Accuracy & Efficiency. Across MVTec AD, VisA, and BTAD, QARAD delivers consistent gains at both image‑ and pixel‑level, while offering a significant speed advantage over diffusion‑based anomaly synthesis.
- Token‑anchored masked sampling. A hard‑gate keeps all context tokens intact; only masked tokens are resampled, conditioned on the prompt.
- Language control. Prompts specify type/shape/size/color/position of the defect; small edits to the prompt yield smooth variations.
- Micro‑structure fidelity. Because context tokens are frozen, the synthesized region inherits high‑frequency material statistics (grain, weave, gloss) from its surroundings—no seam artifacts.
- Per‑sample reliability. Compute an image–text similarity and map it through a monotone calibration (e.g., softmax) to obtain weights.
- Variance reduction. High‑quality syntheses dominate the gradient; low‑quality outliers are softly down‑weighted—stabilizing training without discarding data.
- Drop‑in upgrade. Works with standard detectors and training loops.
- Evaluated on MVTec AD, VisA, and BTAD.
- Demonstrates consistent improvements at image‑level and pixel‑level detection compared to augmentation‑based and diffusion‑based synthesis pipelines.
- Efficiency: ARAS avoids iterative denoising, delivering substantial speed gains in synthesis while keeping detector inference unchanged.
Please see the paper for full quantitative tables, ablations, and qualitative visualizations.
- Exact Locality + Context Preservation: By editing only masked tokens and freezing context, ARAS eliminates low‑res bottlenecks and boundary seams that often mislead detectors.
- Semantic Faithfulness: Language conditioning provides continuous control over defect attributes beyond coarse categories.
- Optimization with Signal, Not Noise: QAW focuses learning on prompt‑consistent synthetic samples, improving robustness and generalization.
If you find QARAD useful for your research, please cite:
@misc{qian2025qualityawarelanguageconditionedlocalautoregressive,
title={Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection},
author={Long Qian and Bingke Zhu and Yingying Chen and Ming Tang and Jinqiao Wang},
year={2025},
eprint={2508.03539},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.03539},
}This project builds upon the open-source codebases of the following works. We are grateful to their authors and communities:
- RealNet — code: cnulab/RealNet • paper: arXiv:2403.05897. (GitHub)
- Infinity — code: FoundationVision/Infinity • project: foundationvision.github.io/infinity.project • paper: arXiv:2412.04431. (GitHub)
We extended and adapted their implementations for our setting—many thanks to the original authors and the open-source community.
For questions or collaborations, please open an issue on the repository or contact me: qianlong2024@ia.ac.cn.