ggml-cpu: add RVV repack GEMM and GEMV for quantization types #6

taimur-10x · 2025-12-15T14:29:17Z

Summary

This PR adds repacking and GEMM/GEMV kernels for quantization types for RVV (VLEN=256).

Key Changes

Added quantize_mat for 4x8 and 4x1 (scalar)
Added repacking RVV GEMM and GEMV kernels for:
- Q2_K
- Q4_0
- Q4_K
- IQ4_NL
- Q8_0

Tile Sizes

VLEN	Tiling	LHS	RHS	OUT
128	4, 8, 1	4x1	8x1	4x8
256	4, 16, 1	4x1	16x1	4x16
512	4, 32, 1	4x1	32x1	4x32
1024	4, 64, 1	4x1	64x1	4x64

Testing

Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit) for a range of input sizes.

Benchmarking Results

End-to-end benchmarking on BananaPI-BPI F3 (VLEN=256)

Q2_K

Prompt Processing

Model	Prompt Size	Repack GEMM 4x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama Q2_K 1.1B	32	19.69	9.69
Tinyllama Q2_K 1.1B	64	18.64	8.56
Tinyllama Q2_K 1.1B	128	16.05	7.88
Tinyllama Q2_K 1.1B	256	16	7.58
Tinyllama Q2_K 1.1B	512	16.01	7.47

Token Generation

Model	Tokens Generated	Repack GEMV 1x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama Q2_K 1.1B	10	10.29	6.54
Tinyllama Q2_K 1.1B	16	10.58	6.16
Tinyllama Q2_K 1.1B	32	9.09	5.66
Tinyllama Q2_K 1.1B	64	7.46	5.5
Tinyllama Q2_K 1.1B	100	7.01	5.17

Q4_0

Prompt Processing

Model	Prompt Size	Repack GEMM 4x8x8 (Tok / s)	Repack GEMM 4x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama Q4_0 1.1B	32	27.80	34.98	9.62
Tinyllama Q4_0 1.1B	64	27.84	33.76	9.28
Tinyllama Q4_0 1.1B	128	28.21	34.99	9.49
Tinyllama Q4_0 1.1B	256	27.57	34.59	9.54
Tinyllama Q4_0 1.1B	512	27.43	32.34	9.51

Token Generation

Model	Tokens Generated	Repack GEMV 1x8x8 (Tok / s)	Repack GEMV 1x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama Q4_0 1.1B	10	10.39	9.86	6.93
Tinyllama Q4_0 1.1B	16	10.44	10.05	6.87
Tinyllama Q4_0 1.1B	32	10.71	9.33	6.94
Tinyllama Q4_0 1.1B	64	9.95	9.83	6.94
Tinyllama Q4_0 1.1B	100	10.66	9.85	6.97

Q4_K

Prompt Processing

Model	Prompt Size	Repack GEMM 4x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama Q4_K 1.1B	32	30.03	12.89
Tinyllama Q4_K 1.1B	64	30.05	13.09
Tinyllama Q4_K 1.1B	128	30.60	12.98
Tinyllama Q4_K 1.1B	256	30.53	12.94
Tinyllama Q4_K 1.1B	512	28.86	12.55

Token Generation

Model	Tokens Generated	Repack GEMV 1x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama Q4_K 1.1B	10	9.49	9.05
Tinyllama Q4_K 1.1B	16	9.54	9.17
Tinyllama Q4_K 1.1B	32	9.61	9.07
Tinyllama Q4_K 1.1B	64	9.44	8.84
Tinyllama Q4_K 1.1B	100	9.18	9.16

IQ4_NL

Prompt Processing

Model	Prompt Size	Repack GEMM 4x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama IQ4 NL 1.1B	32	22.72	9.72
Tinyllama IQ4 NL 1.1B	64	21.30	9.76
Tinyllama IQ4 NL 1.1B	128	22.19	9.02
Tinyllama IQ4 NL 1.1B	256	22.49	8.99
Tinyllama IQ4 NL 1.1B	512	21.02	8.86

Token Generation

Model	Tokens Generated	Repack GEMV 1x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama IQ4 NL 1.1B	10	8.96	7.94
Tinyllama IQ4 NL 1.1B	16	9.01	7.68
Tinyllama IQ4 NL 1.1B	32	8.94	7.45
Tinyllama IQ4 NL 1.1B	64	8.82	7.36
Tinyllama IQ4 NL 1.1B	100	8.94	7.24

Q8_0

Prompt Processing

Model	Prompt Size	Repack GEMM 4x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama Q8_0 1.1B	32	20.53	7.41
Tinyllama Q8_0 1.1B	64	19.11	7.53
Tinyllama Q8_0 1.1B	128	20.09	7.86
Tinyllama Q8_0 1.1B	256	20.23	7.50
Tinyllama Q8_0 1.1B	512	19.46	7.04

Token Generation

Model	Tokens Generated	Repack GEMV 1x16x1 (Tok / s)	Vec Dot (Tok / s)
Tinyllama Q8_0 1.1B	10	5.70	5.72
Tinyllama Q8_0 1.1B	16	5.78	5.75
Tinyllama Q8_0 1.1B	32	5.73	5.74
Tinyllama Q8_0 1.1B	64	5.37	5.59
Tinyllama Q8_0 1.1B	100	5.52	5.71

Future Work

Subsequent PRs plan to extend these kernels for other VLENs.

taimur-10x · 2026-01-14T15:44:37Z

@luhenry, @xctan, could this be reviewed please? Thank you.

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

taimur-10x marked this pull request as draft December 15, 2025 14:29

taimur-10x self-assigned this Dec 15, 2025

github-actions bot added the ggml label Dec 15, 2025

taimur-10x force-pushed the 10x/riscv-quant-repack branch 8 times, most recently from 0627813 to c17aec6 Compare December 30, 2025 16:47

taimur-10x force-pushed the 10x/riscv-quant-repack branch 4 times, most recently from 7cd1506 to bc5df04 Compare January 7, 2026 10:22

taimur-10x force-pushed the 10x/riscv-quant-repack branch 2 times, most recently from e81a7e4 to c4ab03e Compare January 12, 2026 13:16

taimur-10x marked this pull request as ready for review January 12, 2026 13:56

taimur-10x requested review from david-baker-808 and luhenry January 12, 2026 13:56

taimur-10x force-pushed the 10x/riscv-quant-repack branch from c4ab03e to 5bc5184 Compare January 14, 2026 12:24

xctan approved these changes Jan 26, 2026

View reviewed changes

taimur-10x and others added 4 commits January 26, 2026 19:34

ggml-cpu: add rvv ggml_quantize_mat_4x8 for q8_0

14f334e

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

ggml-cpu: add rvv repacking for iq4_nl

103e71c

ggml-cpu: add generic impl for iq4_nl gemm/gemv

090e5e6

ggml-cpu: add rvv repacking for q8_0

c640f8b

taimur-10x force-pushed the 10x/riscv-quant-repack branch 2 times, most recently from 65c4fce to 30ba1cf Compare January 26, 2026 14:53

ggml-cpu: refactor; add rvv repacking for q4_0, q4_K

6a1924e

taimur-10x force-pushed the 10x/riscv-quant-repack branch from 30ba1cf to fbfe444 Compare January 26, 2026 21:32

ggml-cpu: refactor; add rvv repacking for q2_K

eba0839

Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

taimur-10x force-pushed the 10x/riscv-quant-repack branch from fbfe444 to eba0839 Compare January 26, 2026 21:39

taimur-10x merged commit c413339 into master Jan 26, 2026
53 of 75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: add RVV repack GEMM and GEMV for quantization types #6

ggml-cpu: add RVV repack GEMM and GEMV for quantization types #6

Uh oh!

taimur-10x commented Dec 15, 2025 •

edited

Loading

Uh oh!

taimur-10x commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggml-cpu: add RVV repack GEMM and GEMV for quantization types #6

ggml-cpu: add RVV repack GEMM and GEMV for quantization types #6

Uh oh!

Conversation

taimur-10x commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Tile Sizes

Testing

Benchmarking Results

Q2_K

Prompt Processing

Token Generation

Q4_0

Prompt Processing

Token Generation

Q4_K

Prompt Processing

Token Generation

IQ4_NL

Prompt Processing

Token Generation

Q8_0

Prompt Processing

Token Generation

Future Work

Uh oh!

taimur-10x commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taimur-10x commented Dec 15, 2025 •

edited

Loading