HDC Markdown Encoder & Reconstructor

Transforms Markdown documents into hyperdimensional vectors and reconstructs them using dual HDC encoding and an optional LLM reconstruction.

How It Works

Dual HDC Pipeline: Documents split into semantic content and positional structure, encoded as 10,000-dimensional vectors, then reconstructed via HDC unbinding.

Content Vector: Encodes which words exist
Position Vector: Encodes which positions are used
Pair Vectors: HDC binding creates word-position associations
Reconstruction: HDC unbinding + LLM recovers original text

Key Innovation: Mathematical "handshake" between words and positions enables perfect order recovery.

Algorithm

HDC Binding: pair_vector = word_vector * position_vector
HDC Unbinding: recovered_word = document_vector * position_vector
Vector Bundling: content_vector = sum(word_vectors)
Storage: int8 (±1) for maximum compression more details in architecture_en.md

Key Features

Universal Dictionary: 20,000-word shared vocabulary
Scalable: Linear O(n) performance

Installation

git clone https://github.com/Garletz/HDC-Markdown-Encoder-Reconstructor.git

pip install -r requirements.txt && pip install -e .

CLI Usage

# Encode document
python cli.py --encode-dual "document.md" --config config.yaml

# Reconstruct from vectors
python cli.py --reconstruct-dual \
  --content-vector "encoded_vectors/encoded_X_content.npy" \
  --position-vector "encoded_vectors/encoded_X_position.npy" \
  -o "output.md"

Performance if word not repeated

Tokens	Encoding	Storage	Reconstruction	Accuracy
8	0.8s	240KB	1.2s	100%
16	1.1s	480KB	1.8s	100%
50+	2.3s	1.5MB	3.1s	100%

Output Files

encoded_vectors/
├── encoded_N_content.npy    # Semantic information
├── encoded_N_position.npy   # Structural information
└── encoded_N_pairs.npy      # Word-position bindings

Limitations

Repeated words may cause position confusion (5-10 and more % cases)
Out-of-vocabulary tokens are skipped

Project Goal

Enable ultra-light document transfer via semantic vector compression.
If sender and receiver share the same item memory (dictionary), the original text can be perfectly reconstructed from compact .npy vectors.
This approach aims to enable a wide range of future use cases...
Poneglyph ...

Currently experimental — concept in development.

📞 Contact

Get in touch with the OpenDataHive team:

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
encoded_vectors		encoded_vectors
hdc_markdown_transformer		hdc_markdown_transformer
md de test		md de test
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
architecture.md		architecture.md
architecture_en.md		architecture_en.md
cli.py		cli.py
config.txt		config.txt
config.yaml		config.yaml
english_vocab.txt		english_vocab.txt
pipeline_debug.log		pipeline_debug.log
requirements.txt		requirements.txt
setup.py		setup.py
test_similarity_performance.py		test_similarity_performance.py
vocab_hdc.txt		vocab_hdc.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HDC Markdown Encoder & Reconstructor

How It Works

Algorithm

Key Features

Installation

CLI Usage

Performance if word not repeated

Output Files

Limitations

Project Goal

📞 Contact

About

Uh oh!

Releases

Packages

Languages

License

Garletz/HDC-Markdown-Encoder-Reconstructor

Folders and files

Latest commit

History

Repository files navigation

HDC Markdown Encoder & Reconstructor

How It Works

Algorithm

Key Features

Installation

CLI Usage

Performance if word not repeated

Output Files

Limitations

Project Goal

📞 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages