This repository contains the code, methods, and scripts for implementing Superfloat Quantization and Lottery Ticket Hypothesis (LTH) techniques for optimizing neural networks. The repository focuses on various quantization algorithms, model evaluations, and fine-tuning techniques to minimize perplexity and stabilize activations.
Superfloat is a custom quantization algorithm that operates with a scalable precision format. Unlike traditional floating-point systems (IEEE-754), Superfloat removes the mantissa entirely and focuses solely on the exponent for precision representation.
-
Sign-Exponent Representation:
- Superfloat (SFx) uses
1 bitfor the sign and allocates the remainingx-1 bitsfor the exponent. - For instance, in SF16:
- 1 bit → Sign
- 15 bits → Exponent
- Superfloat (SFx) uses
-
Clamping Range:
- All values are clamped within the range
[-1, 1]. This ensures activation and parameter stability, reducing the likelihood of exploding or vanishing gradients.
- All values are clamped within the range
-
Bit-width Flexibility:
- Superfloat supports variable precision formats, scaling between 3-bit and 16-bit:
- Lower precision (e.g., SF4) → Faster computation, reduced model size.
- Higher precision (e.g., SF16) → Improved accuracy while maintaining efficient quantization.
- Superfloat supports variable precision formats, scaling between 3-bit and 16-bit:
-
Gradient and Activation Capping:
- To stabilize the training process, gradients and activations are capped at -1 and +1.
- Saves precision without a significant drop in accuracy.
- Reduces computational complexity compared to traditional floating-point representations.
- Allows adaptive scaling for diverse quantization requirements.
Conversion FP32 - SF(4-16)
A standard 32-bit floating-point number is converted into a custom superfloat representation with a variable-sized mantissa.
-
Clamp Input Range – The input value is restricted to the range (-1, 1). If the value exceeds this, it is set to a predefined maximum value.
-
Extract Sign Bit – The sign bit is determined and stored separately, while the value is converted to its absolute form.
-
Compute Mantissa – The fractional value is scaled by
2^mantissa_bitsto convert it into an integer representation. -
Bit Packing – The sign bit and mantissa are arranged into a custom format, with the mantissa shifted to fit within a float-sized bit structure.
-
Bitwise Reinterpretation – The constructed bit pattern is reinterpreted as a floating-point number and returned.
WASQ stands for Weight and Activation Superfloat Quantization. It is a hybrid quantization framework that leverages Superfloat precision to optimize both model weights and activations.
-
Weight Quantization:
- Model weights are converted to Superfloat precision (SFx) without requiring complex computations like mantissa adjustments.
-
Activation Quantization:
- Activations are clamped and quantized within a stable range to prevent issues such as exploding activations.
-
Optimization Algorithms:
- WASQ includes customized algorithms like WASQ OPT and Full Parameter Method (FPM) to balance accuracy and convergence speed.
- New: Simulated Annealing Multi-Prize Lottery Ticket (SA-MPLTH) algorithm for healing quantized models
-
Scalability:
- WASQ supports multi-bit quantization (from 4-bit to 16-bit), making it adaptable for different deployment environments, such as:
- Edge devices → Lower precision for speed and memory savings.
- Servers → Higher precision for accuracy-sensitive tasks.
- WASQ supports multi-bit quantization (from 4-bit to 16-bit), making it adaptable for different deployment environments, such as:
WASQ integrates LTH to identify specific weights that are critical for maintaining model performance after quantization. By fine-tuning only the essential weights, WASQ reduces computational overhead while achieving high accuracy.
-
Quant_Dequant.ipynb
Contains the implementation of basic Superfloat quantization and dequantization functions. -
sf16quant.ipynb
Builds on Superfloat quantization functions, specifically for SF16 precision. -
lth_analysis.py
Analyzes activation magnitude distribution for LTH. It compares activation patterns of original and quantized models. -
lth_trainer.py
The LTH trainer script for fine-tuning models based on the Lottery Ticket Hypothesis technique. -
wasq_eval.py
Calculates perplexity for a series of models, grouped by context length, epochs, or model species. -
wasq_inference.py
Provides inference capabilities for individual or multiple WASQ-quantized models. -
wasq_fasteropt.py
An optimized version of the OPT algorithm implemented inwasq_opt.py. -
wasq_opt.py
Core implementation of the WASQ OPT algorithm. -
wasq_fpm.py
Implements the Full Parameter Method (FPM) for WASQ quantization. -
wasq_vanilla.py
Baseline implementation of the Vanilla algorithm for WASQ. -
sa_mplth.py
New: Implements Simulated Annealing Multi-Prize Lottery Ticket Hypothesis for healing quantized models. -
assets/results
Contains outputs of model tests, perplexity scores, and supplementary studies.
For a model with n parameters, a calibration dataset of maximum input length c, three-shot quantization fine-tuning, and Superfloat precision bit x (where 4 ≤ x ≤ 16):
[ P = f(n, c, 3, x) ]
- Lower P indicates better model understanding and calibration performance.
This scaling law uses the Lottery Ticket Hypothesis for WASQ quantization to stabilize activations:
- Perform a forward pass using the original model and record the average magnitudes of activations across all layers.
- Perform the same for the vanilla quantized model to observe how quantization impacts activation magnitudes.
- Rank layers based on the difference in activation magnitudes between the original and quantized models.
- Identify and cluster layers with significant deviations to address issues like exploding/vanishing activations.
- Fine-tune or analyze these clusters to ensure stable activations and minimal performance degradation.
The law establishes that the maximum neuron spread (region targeted for fine-tuning/updating) is a function of:
- Activation magnitude
- Activation fracture (spread of how a weight affects neighboring weights during backpropagation)
The repository explores three quantization approaches:
- Superfloat Precision: Custom precision without mantissa, clamped within
[-1, 1]for stability. - WASQ OPT: Optimized quantization with faster convergence.
- Full Parameter Method (FPM): Retrains all parameters for higher accuracy.
- SA-MPLTH: New simulated annealing approach for healing quantized models.
Clone the repository and install dependencies:
git clone https://github.com/aloshdenny/superfloat
cd superfloat
pip install -r requirements.txt-
Train with LTH:
python lth_trainer.py
-
Evaluate Perplexity:
python wasq_eval.py
-
Perform Inference:
python wasq_inference.py
-
Run SA-MPLTH:
python sa_mplth.py
The assets/results folder contains:
- Perplexity scores for different model configurations.
- Activation magnitude comparisons before and after quantization.
- Supplementary studies showcasing model performance.
Atreides is an ASIC accelerator designed specifically for Superfloat-based inference. We redesigned the systolic array to support SFx operations, adopting a modded RV32 ISA and faster Fused-Multiply-Adder (FMA) units. The end goal is not convention—it's breaking the rules of computing and physics to achieve faster inference, lower memory consumption, and the same accuracy.
Below is an image showing the FMA in Atreides:
An expanded view of Chip-1's architecture includes non-unified memory blocks (subject to unification), cache, control store (modded RV32 ISA), and an array of FMAs:
The current instruction set for the FPGA architecture is show below:
| Instruction | Opcode(4) | Op 1(4) | Op 2(4) | Op 3(4) | Description |
|---|---|---|---|---|---|
| STR | 0001 | addr | row | col | Stores the matrix data from activation unit buffer into specified address in memory |
| LDR | 0010 | addr | row | col | Loads the matrix at addr into the Row Shift Buffer |
| LDC | 0011 | addr | row | col | Loads the matrix at addr into the Column Shift Buffer |
| MATMUL | 0100 | - | - | - | Performs matrix multiplication using data in Row Shift Buffer and Column Shift Buffer |
| RELU | 0101 | - | - | - | Performs ReLU activation function on Systolic Array output |
| LIN | 0110 | - | - | - | Performs Linear activation function on Systolic Array output |
| NOP | 0000 | - | - | - | No Operation |
The FPGA floorplan integrated with instruction set is shown below:
Contributions are welcome! Feel free to open issues or submit pull requests.
We would like to thank our sponsors for their support:
This project is licensed under the MIT License.







