KIT-2920 is a reading-first transliteration infrastructure for the Chinese Buddhist Canon.
It converts large Hanzi corpora (CBETA / SAT / Taishō / other editions) into:
- Clean plain text
- Fully Latin-readable Pinyin layer
- Machine-friendly ZIP corpus
- No XML nesting
- No TEI parsing required
This project does NOT compete with CBETA or SAT.
It builds a machine-friendly reading layer on top of them.
CBETA XML-P5 is excellent for scholarship.
But:
- Deep TEI nesting
- Editorial markup
- Complex structural tags
For most developers and general readers:
Parsing becomes the barrier.
KIT-2920 removes that barrier.
The canon does not need UI. It needs a clear passage.
The SAT Daizōkyō Text Database (SAT 大正新脩大藏經テキストデータベース)
is one of the most important digital editions of the Taishō Tripiṭaka.
SAT provides:
- High-quality digital transcription of Taishō volumes
- Careful editorial structure
- Stable academic references
- Page/column line alignment with printed Taishō edition
- Long-term institutional maintenance (University of Tokyo)
SAT is widely used in:
- Japanese Buddhist studies
- East Asian textual research
- Citation-standard academic publications
- Digital humanities projects
KIT-2920 does NOT modify SAT content.
It:
- Extracts readable Hanzi text
- Preserves editorial separation
- Does not merge or overwrite SAT decisions
- Respects Volume 85 distinctions
Where CBETA and SAT differ, KIT-2920 keeps them separate.
No forced harmonization.
Taishō volumes 56–84 rely heavily on SAT digital sources.
Without SAT:
- Many later Taishō texts would not be easily accessible
- Cross-reference alignment would be difficult
- Machine-scale processing would be unstable
SAT provides structural integrity.
KIT-2920 provides reading accessibility.
They serve different but complementary purposes.
This project acknowledges the immense scholarly labor behind:
- The SAT Daizōkyō Database
- CBETA editorial teams
- The original Taishō compilers
KIT-2920 stands on their foundation.
It only builds a bridge for broader access.
1-cbeta-sat-taisho-dzk-85vol-text-Hanzi.zip
2-cbeta-sat-taisho-dzk-85vol-text-Hanzi.zip
3-cbeta-sat-taisho-dzk-85vol-text-Hanzi.zip
4-cbeta-sat-taisho-dzk-2920n-text-Hanzi.zip
5-cbeta-sat-taisho-dzk-2920n-text-Hanzi.zip
6-cbeta-sat-taisho-dzk-2920n-text-Hanzi.zip
Characteristics:
- Flat UTF-8 text
- No forced unification of CBETA/SAT differences
- Volume 85 separated honestly
- Grep/search friendly
- Archive-ready
Currently stable:
Hanzi → Pinyin (Tone)
Output example:
佛說阿彌陀經
fó shuō ā mí tuó jīng
Properties:
- Space-separated syllables
- Tone marks preserved
- No blank output
- No raw Hanzi leakage
- Dictionary override supported
Processing flow:
ZIP corpus
→ extract
→ phrase dictionary override
→ word dictionary override
→ pypinyin fallback
→ clean UTF-8 text
→ repackage ZIP
Dictionary priority:
- dict-viet-phrase.json
- dict-viet.json
- dict-budda.json
- pypinyin fallback
Example (Taishō full corpus):
Files: 2457
Size: 82 MB → 249 MB
Time: ~33 seconds
Example (Qianlong):
Files: 1778
Time: ~19 seconds
Heavy mode supported.
github/
│
├── convert.py
├── requirements.txt
├── dict/
│ ├── dict-budda.json
│ ├── dict-fgs.json
│ ├── dict-viet.json
│ └── dict-viet-phrase.json
│
├── output/
│
└── corpus *.zip
The corpus must be readable immediately.
Flat text > XML nesting.
Volume 85 not forcibly merged.
If dictionary fails → fallback to pinyin. Never output empty. Never block reading.
KIT-2920:
- Does NOT replace CBETA
- Does NOT edit canonical text
- Does NOT claim textual authority
It is a:
Script bridge and reading infrastructure layer.
Near-term:
- Multi-script expansion (CJKV)
- Vietnamese smoothing layer
- Archive publishing (IA / Zenodo)
- SHA256 checksum release
Mid-term:
- Multi-layer output (Hanzi + Pinyin)
- Cross-edition alignment mode
- Phrase refinement system
Long-term:
- Full script-bridge platform
If CBETA XML reaches scholars, KIT-2920 aims to reach:
- Developers
- General readers
- Overseas Chinese communities
- Latin-script users
- AI systems
- Archive platforms
Removing XML complexity can increase accessibility dramatically.
Source corpus:
- CBETA
- SAT Daizōkyō
- Taishō Shinshū Daizōkyō
This project redistributes processed text. It does not modify canonical content.
Respect original editorial sources.
The canon does not need protection from readers.
It needs protection from barriers.
SAT 大正新脩大藏經テキストデータベース
是目前最重要的大正藏數位版本之一。
SAT 提供:
- 高品質的大正藏數位文本
- 嚴謹的學術編輯結構
- 穩定的學術引用基準
- 與紙本大正藏對應的頁碼、欄位與行號
- 長期由東京大學等機構維護
SAT 被廣泛使用於:
- 日本佛教研究
- 東亞佛典文本研究
- 學術引用與標準出版
- 數位人文計畫
KIT-2920 不修改 SAT 的原始內容。
本專案:
- 僅提取可閱讀的漢字文本
- 保留不同版本的編輯差異
- 不強制合併 CBETA 與 SAT
- 對於第 85 卷保持獨立處理
當 CBETA 與 SAT 有差異時,
KIT-2920 採取「並存而不混合」的原則。
不進行強制統一。
大正藏第 56–84 卷主要依賴 SAT 數位資料。
若無 SAT:
- 後期卷冊的數位取得將更加困難
- 頁碼對應與學術引用將不穩定
- 大規模機器處理將缺乏結構依據
SAT 提供的是學術結構的穩定性。
KIT-2920 提供的是閱讀層與機器友善層。
兩者角色不同,但互補。
本專案對以下學術團隊深表敬意:
- SAT 大正藏資料庫團隊
- CBETA 編輯團隊
- 原大正藏編纂者
KIT-2920 只是建立在這些基礎之上的
一個閱讀與轉寫橋樑。
Users should verify upstream licensing conditions before redistribution.
The entire Taishō Tripiṭaka has been processed and archived through the KIT-2920 infrastructure. These reading layers provide unprecedented accessibility to the Buddhist Canon across various scripts and systems:
| Script / System | Link (Internet Archive) | Description |
|---|---|---|
| Hanzi | Access | Original Chinese Characters (Clean Text) |
| Pinyin | Access | Standard Latin with Tone Marks |
| IPA | Access | International Phonetic Alphabet (Linguistic) |
| Zhuyin | Access | Bopomofo Phonetic Symbols |
| Script / System | Link (Internet Archive) | Description |
|---|---|---|
| Brahmi | Access | The Ancestral Script of Ashoka |
| Siddham | Access | Sacred Script of Esoteric Buddhism |
| Devanagari | Access | Modern Standard Indic Script |
| Sinhala | Access | Sri Lankan Buddhist Tradition |
| Tibetan | Access | Himalayan Vajrayana Representation |
| Tamil | Access | Classical Southern Indic Script |
| Script / System | Link (Internet Archive) | Description |
|---|---|---|
| Thai | Access | Central Thai Orthography |
| Laos | Access | Laotian Script Reading Layer |
| Khmer | Access | Cambodian Canonical Script |
| Burmese | Access | Myanmar Script Tradition |
| Script / System | Link (Internet Archive) | Description |
|---|---|---|
| Hirakata | Access | Japanese Hiragana/Katakana Hybrid |
| Katakana | Access | Japanese Full Katakana Layer |
| Hangul | Access | Korean Phonetic Script |
| Quoc Ngu | Access | Vietnamese Latinized Script |
| Chu Nom | Access | Archaic Vietnamese Character System |
| Mongolian | Access | Central Asian Steppe Representation |
| Script / System | Link (Internet Archive) | Description |
|---|---|---|
| Braille | Access | Tactile Reading for Visually Impaired |
| Cyrillic | Access | Slavic/Central Asian Phonetic Layer |
| Greek | Access | Hellenic Script Transcription |
| Morse | Access | Universal Signal Transmission Layer |
Full Collection: Explore the Universal Tripitaka Project on Archive.org