A Go utility that crawls documentation websites and packages them into an offline bundle with full-text search capabilities. Built with Go's HTTP client and HTML parsing, featuring deterministic builds and content deduplication.
- 🕷️ Web Crawler: Recursively crawl documentation websites with configurable depth
- 🔍 Full-Text Search: Built-in search index for fast offline searching
- 🎯 Content Deduplication: SHA-256 hash-based deduplication to avoid storing duplicate content
- 🌐 Offline Web Server: Minimal web server for browsing bundled documentation offline
- 🔄 Deterministic Builds: Sorted output and consistent hash generation for reproducible builds
- 📦 URL Rewriting: Automatically rewrites URLs for offline use
- 🎨 Modern UI: Clean, responsive web interface for searching and browsing
git clone https://github.com/BaseMax/go-doc-bundler.git
cd go-doc-bundler
go build -o go-doc-bundler .- Go 1.20 or higher
- Internet connection for crawling
Crawl a documentation website and create an offline bundle:
./go-doc-bundler crawl -url https://pkg.go.dev/fmt -output ./bundle -depth 2-url: Starting URL to crawl (required)-output: Output directory for bundled documentation (default:./bundle)-depth: Maximum crawl depth (default:3)-domain: Restrict crawling to this domain (optional, inferred from URL if not specified)
# Crawl Go's fmt package documentation with depth 2
./go-doc-bundler crawl -url https://pkg.go.dev/fmt -depth 2
# Crawl with custom output directory
./go-doc-bundler crawl -url https://docs.example.com -output ./my-docs
# Restrict to specific domain
./go-doc-bundler crawl -url https://example.com/docs -domain example.com -depth 3Start a local web server to browse the bundled documentation:
./go-doc-bundler serve -dir ./bundle -port 8080Then open http://localhost:8080 in your browser.
-dir: Directory containing bundled documentation (default:./bundle)-port: Port to serve on (default:8080)
After crawling, the bundle directory contains:
bundle/
├── content/ # Downloaded HTML pages organized by domain and path
│ └── pkg.go.dev/
│ ├── fmt.html
│ └── ...
├── manifest.json # Metadata about the crawl (URLs, timestamps)
└── search-index.json # Full-text search index
The manifest.json file contains:
{
"version": "1.0",
"domain": "pkg.go.dev",
"crawled_at": "2025-12-19T20:19:32Z",
"page_count": 163,
"pages": ["url1", "url2", ...]
}The search-index.json contains tokenized documents with an inverted index for fast searching.
-
Bundler (
pkg/bundler): Core crawling and bundling logic- HTTP client with timeout configuration
- HTML parsing and link extraction
- URL normalization and rewriting
- Content deduplication using SHA-256 hashing
- Deterministic output with sorted URLs
-
Search Index (
pkg/index): Full-text search capabilities- Document tokenization
- Inverted index generation
- Search query processing
- JSON serialization with sorted keys for deterministic output
-
Server (
pkg/server): Offline web server- Static file serving
- Search API endpoint
- Modern web UI with search interface
The tool ensures deterministic builds through:
- Sorted URL lists in manifest
- Sorted index keys in search index
- Consistent hash generation for content deduplication
- Reproducible output structure
-
Crawl documentation:
./go-doc-bundler crawl -url https://pkg.go.dev/fmt -depth 1
-
View the bundled content:
ls -lh bundle/ # Shows: content/, manifest.json, search-index.json -
Start the local server:
./go-doc-bundler serve -dir ./bundle -port 8080
-
Access the documentation:
- Open http://localhost:8080 in your browser
- Use the search box to find specific content
- Browse the cached pages offline
The bundler uses SHA-256 hashing to detect and skip duplicate content:
- Each page's content is hashed before saving
- Duplicate hashes are detected and logged
- Only unique content is stored
URLs are automatically rewritten for offline use:
- Absolute URLs from the same domain are converted to relative paths
- Links point to locally stored content
- External links are preserved as-is
The search index uses an inverted index structure:
- Text is tokenized into lowercase words
- Words shorter than 3 characters are filtered out
- Documents are ranked by word frequency match
- Results are sorted deterministically by document ID
MIT License - See LICENSE file for details
Contributions are welcome! Please feel free to submit a Pull Request.