The Sketch pipeline is designed to verify the correctness of LLM-generated assembly code by comparing it against source assembly at both semantic and symbolic levels.
- Custom Grammar (UniAssembly):
We use a hand-crafted ANTLR grammar (UniAssembly) to parse real-world ARM64 and RISC-V assembly into a structured abstract syntax tree (AST). This allows us to reason about instructions, labels, directives, and operands in a uniform way.
-
Attention-Based Mapping:
We start with attention maps from a model that relate predicted tokens to source tokens. These are used to align each predicted line with its most likely source line. -
Line Comparison:
- For each aligned pair, we parse both lines using our grammar.
- We compare the type (e.g., instruction, label, directive).
- If both are instructions, we check semantic equivalence using a normalized instruction class (e.g., add, move,
jump). - Operands are compared with:
- Exact matching for immediates
- Heuristic/distance-based comparison for registers
-
Line Scoring:
This gives us a confidence score for each predicted-to-source line match. Higher scores indicate stronger structural and semantic alignment.
-
Finding Pure Blocks:
- From each matched instruction line, we try to extract a pure arithmetic block.
- A line is considered pure if it contains only arithmetic instructions (no memory, branching, or directives).
- We grow the block upward and downward until we hit a non-pure instruction.
-
Block Verification (Z3):
- For each extracted block:
- We identify input registers (used before defined)
- The last defined register is treated as the block’s output
- Blocks are validated only if both predicted and source blocks have the same number of inputs.
- For each extracted block:
-
Symbolic Execution in Z3:
- Each block is converted into a symbolic function using Z3Py.
- Registers are modeled as BitVec variables.
- Execution is simulated using semantic mappings (e.g., add, xor,
sext.w) translated to Z3 expressions. - We use z3.ForAll(...) to assert that for all possible inputs, both blocks produce identical outputs.
This enables us to:
- Automatically flag semantically incorrect or suspicious generated blocks
- Prepare the groundwork for automated correction, guided by Z3 counterexamples and model attention scores
The end goal is a verification-aware generation loop, where symbolic mismatch acts as a correctness signal to further train, fine-tune, or patch the model’s outputs.
