Skip to content

Incorrect Decoder Compilation for DEIMv2 on neuronx #116

@cricket1

Description

@cricket1

Summary

The AWS Neuron compiler (torch-neuronx) produces incorrect outputs when compiling a transformer-based object detection decoder with discrete attention mechanism. The compiled model produces vastly different logits and bounding boxes compared to PyTorch, resulting in zero detections on real images.

Environment

# System info
OS: Ubuntu (Linux 6.8.0-1040-aws)
Instance: inf2.xlarge (tested on CPU instance for compilation)

# Python packages (confirmed versions)
Python: 3.10.12
torch: 2.8.0
torch-neuronx: 2.8.0.2.10.16998+e9bf8a50
torch-xla: 2.8.1
neuronx-cc: 2.21.33363.0+82129205
torchvision: 0.23.0

Problem Description

Expected Behavior

When compiling a transformer decoder with discrete attention to Neuron format, the compiled model should produce outputs numerically close to PyTorch (within typical BF16 precision tolerance of ~1e-2).

Actual Behavior

The Neuron-compiled decoder produces completely incorrect outputs:

  • Logits max difference: 7.84 (compared to ~1e-2 expected)
  • Logits mean difference: 1.25 (compared to ~1e-4 expected)
  • Relative difference: 160,087% (!)
  • Detection count: 0 detections (PyTorch: 60 detections)
  • Top-5 confidence scores: [0.25, 0.24, 0.23, 0.21, 0.21] (PyTorch: [0.93, 0.90, 0.89, 0.84, 0.84])

Components Affected

  • Backbone (HGNetv2): ✓ Works correctly (max diff ~0.7)
  • Encoder (HybridEncoder): ✓ Works correctly (max diff ~0.18)
  • Decoder (DEIMTransformer with discrete attention): ✗ Completely broken

Reproduction

1. Model Architecture

The decoder is a DETR-style transformer decoder with:

  • Multi-scale deformable attention using discrete indexing (not F.grid_sample)
  • 3 decoder layers
  • 300 object queries
  • 80 classes (COCO)
  • 2 feature levels

Key detail: The model uses cross_attn_method: 'discrete' which implements deformable attention via discrete tensor indexing instead of F.grid_sample() (which is not supported by Neuron).

3. Full Reproduction

Complete reproduction code available at: /home/ubuntu/sh_deimv2/

Key files:

To reproduce:

# Convert decoder to Neuron
python convert_components_neuronx.py \\
    --checkpoint models/deimv2_hgnetv2_n_coco.pth \\
    --config configs/deimv2/deimv2_hgnetv2_n_coco.yml \\
    --component all

# Verify (shows the bug)
python verify_neuronx_on_image.py --image example.jpg

Investigation Results

What We've Tried

  1. Different compiler optimizations: Tested -O0, -O1, -O2
  2. FP32 precision: --fp32-cast=all, --fp32-cast=matmult
  3. Disabled auto-casting: --auto-cast=none
  4. Model type hint: --model-type=transformer
  5. Various combinations: See try_neuronx_compiler_settings.py

None of the above produced correct outputs.

Key Observations

  1. Backbone and encoder compile correctly - The issue is isolated to the decoder
  2. PyTorch discrete attention works perfectly - The implementation is correct in PyTorch
  3. Error is not precision-related - Differences are 1000x larger than BF16 tolerance
  4. Consistent failure - Fails on both random inputs and real images

Metadata

Metadata

Assignees

No one assigned

    Labels

    Inf2bugSomething isn't workingcompilationCompilation doesn't work

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions