This repository accompanies the AIED 2023 workshop paper “Does ChatGPT Comprehend Place Value in Numbers When Solving Math Word Problems?”. It contains all code, data, and cached results needed to reproduce the experiments exploring how GPT‑class models handle place‑value information in numerals.
| Goal | |
|---|---|
| Problem | Even strong language models often stumble on arithmetic with large numbers. We investigate whether this is due to incomplete understanding of place value – the idea that the digit “5” means different amounts in 5, 50, 500… |
| Approach | Two controlled experiments using gpt‑3.5‑turbo: 1. Multiplet Ordering (Experiment 1) – order sets of digits and their permutations to test explicit vs. implicit place‑value reasoning. 2. English ↔ Numeric Expressions (Experiment 2) – swap numerals for English words inside Chain‑of‑Thought (CoT) and Program‑of‑Thought (PoT) prompts when solving SVAMP math‑word problems. |
| Key Findings | • Ordering accuracy drops sharply from 96.8 % on single‑token numbers to 60.5 % when place value must be inferred across multiple tokens. • Re‑expressing numbers in English improves math problem solving (CoT +3.1 pp, PoT +2 pp). |
| Implication | Current tokenisation fragments place‑value structure. Better prompts or fine‑tuning that restore this information could boost LLM numeracy. |
├── data/ # ⭢ generated test sets & SVAMP subset
│ ├── SVAMP.json # • base dataset
│ └── SVAMP_scaled_* # • scaled numbers for stress tests
├── datasets/ # dataset loader module
├── results/ # cached model outputs and metrics
├── core.py # prompt templates and evaluation logic
├── main.py # example script to reproduce results
├── util.py # helper functions
└── test_util.py # unit tests for util
Each entry in data/SVAMP.json is a small math word problem with fields like:
{
"ID": "chal-1",
"Body": "Each pack of dvds costs 76 dollars. If there is a discount of 25 dollars on each pack",
"Question": "How much do you have to pay to buy each pack?",
"Equation": "( 76.0 - 25.0 )",
"Answer": 51.0,
"Type": "Subtraction"
}
Scaled versions (SVAMP_scaled_*.json) contain the same questions with larger numbers to stress‑test the models.
- Install the dependencies (Python ≥ 3.9).
pip install -r requirements.txt
- Export your OpenAI API key:
export OPENAI_API_KEY=<your-key>
- Run the example script to evaluate a model:
Results appear in the
python main.py
results/directory.
| Setting | Accuracy | Accuracy (Self‑Consistency) |
|---|---|---|
| CoT – Numeric | 76.8 % | 81.2 % |
| CoT – English | 79.9 % | 82.5 % |
| PoT – Numeric | 80.0 % | 82.6 % |
| PoT – English | 82.0 % | 83.7 % |
Multiplet Ordering (Exp 1): accuracy falls from 96.8 % (source) → 60.5 % (permutation) when place value must be inferred.
- Python ≥ 3.9
openai,tiktoken,sympy,pandas,numpy,matplotlib,tqdm
See requirements.txt for exact versions. GPU is not required because the calls go to the OpenAI API.
- Try a new model – set
--modelinmain.pyto any endpoint such asgpt-4o. - Stress-test bigger numbers – use one of the
SVAMP_scaled_*files or create your own viascale_up_dataset.py. - Add your dataset – drop a JSON file with the same format into
data/and modifydatasets/dataset.pyaccordingly.
Pull requests are welcome!
This project is released under the MIT License. See LICENSE for details.