GPU-driven rendering: merge GPU buffers to eliminate per-VItem CPU submission overhead

## Background

The current rendering pipeline maintains separate GPU buffers and bind groups for each VItem. Every frame requires per-VItem calls to `write_buffer`, `set_bind_group`, and `draw`.

Benchmarks from [`benches/benches/gpu_render.rs`](https://github.com/AzurIce/ranim/blob/main/benches/benches/gpu_render.rs) on the main branch show that **CPU submission is the dominant bottleneck**, not the GPU:

| VItem count | GPU total (submit+wait) | CPU submit (no GPU wait) | GPU pure (diff) |
|---|---|---|---|
| 25 | 5.6 ms | 1.6 ms | ~4.0 ms |
| 100 | 8.9 ms | 4.1 ms | ~4.8 ms |
| 400 | 22.3 ms | 25.2 ms | — (CPU already exceeds GPU) |
| 1600 | 88.8 ms | 92.7 ms | — |
| 3600 | 256 ms | 220 ms | ~36 ms |

CPU cost scales linearly at **~55μs/VItem**. At 1600 VItems, CPU accounts for ~89% of total frame time. Root cause: 3600 VItems × 7 `write_buffer` calls = 25,200 API calls, each with fixed overhead (lock, validation, staging allocation).

## Approach

Adopt a **GPU-driven rendering** strategy — merge all VItem data into a single set of contiguous GPU buffers:

1. **Every-frame rebuild**: each frame, pack all VItem points, fill_rgbas, stroke_rgbas, and stroke_widths into large contiguous buffers, with an `ItemInfo` index table recording each item's offset and count
2. **Instanced drawing**: use `draw(0..4, 0..N)` to render all N VItems in a single draw call; the vertex shader uses `instance_index` to look up per-item clip box and plane data
3. **Binary search in compute shader**: each compute thread binary-searches `item_infos` to determine which item it belongs to, then performs 3D→2D projection and atomic clip box updates

This reduces per-VItem O(N) draw calls / bind group switches / write_buffer calls to O(1).

## Related

- PR: #138

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-driven rendering: merge GPU buffers to eliminate per-VItem CPU submission overhead #139

Background

Approach

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

VItem count	GPU total (submit+wait)	CPU submit (no GPU wait)	GPU pure (diff)
25	5.6 ms	1.6 ms	~4.0 ms
100	8.9 ms	4.1 ms	~4.8 ms
400	22.3 ms	25.2 ms	— (CPU already exceeds GPU)
1600	88.8 ms	92.7 ms	—
3600	256 ms	220 ms	~36 ms

GPU-driven rendering: merge GPU buffers to eliminate per-VItem CPU submission overhead #139

Description

Background

Approach

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions