feat: integrate Marlin/AllSpark INT8 W8A16 quantization strategy#25
Merged
luozixin2 merged 1 commit intoSJTU-DENG-Lab:feat/kv-cache-fp8-supportfrom Jan 16, 2026
Conversation
主要新增内容:
1. **Marlin/AllSpark INT8 W8A16 量化策略集成**:
- 新增 linear_marlin_int8_w8a16.py:实现基于 vLLM AllSpark kernel 的 W8A16 量化策略
- 新增 diffulex_kernel/csrc/marlin/:vendored vLLM 的 AllSpark CUDA kernels
* allspark_qgemm_w8a16.cu: W8A16 fused GEMM kernel
* allspark_repack.cu: N32K16 权重重排 kernel
* allspark_utils.cuh: 工具函数和数据结构
* torch_bindings_marlin.cpp: PyTorch C++ 绑定
- 新增 diffulex_kernel/python/marlin_ops.py:Python 接口用于 JIT 编译和加载 Marlin/AllSpark kernels
2. **量化策略注册更新**:
- 在 registry.py 中添加 'marlin' 别名支持(映射到 marlin_int8)
- 在 strategies/__init__.py 中导入新的策略
3. **性能改进**:
- Marlin W8A16 策略显著提升了 Prefill 吞吐量(从 4518.92 tok/s 提升到 9520.91 tok/s,约 2.1 倍)
- Decode 吞吐量接近 BF16 基线(23.16 tok/s vs 23.36 tok/s)
- 支持与 FP8 KV cache 组合使用
4. **其他改进**:
- 优化了多个量化策略的实现
- 改进了 KV cache 管理
- 增强了 profiler 功能
- 新增了多个 benchmark 配置文件
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
55b8b4d
into
SJTU-DENG-Lab:feat/kv-cache-fp8-support
2 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
主要新增内容:
Marlin/AllSpark INT8 W8A16 量化策略集成:
量化策略注册更新:
性能改进:
其他改进: