Skip to content

A Go utility that crawls documentation websites and packages them into an offline bundle with full-text search capabilities. Built with Go's HTTP client and HTML parsing, featuring deterministic builds and content deduplication. A tool for bundling online documentation into offline, searchable archives.

License

Notifications You must be signed in to change notification settings

BaseMax/go-doc-bundler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-doc-bundler

A Go utility that crawls documentation websites and packages them into an offline bundle with full-text search capabilities. Built with Go's HTTP client and HTML parsing, featuring deterministic builds and content deduplication.

Features

  • 🕷️ Web Crawler: Recursively crawl documentation websites with configurable depth
  • 🔍 Full-Text Search: Built-in search index for fast offline searching
  • 🎯 Content Deduplication: SHA-256 hash-based deduplication to avoid storing duplicate content
  • 🌐 Offline Web Server: Minimal web server for browsing bundled documentation offline
  • 🔄 Deterministic Builds: Sorted output and consistent hash generation for reproducible builds
  • 📦 URL Rewriting: Automatically rewrites URLs for offline use
  • 🎨 Modern UI: Clean, responsive web interface for searching and browsing

Installation

From Source

git clone https://github.com/BaseMax/go-doc-bundler.git
cd go-doc-bundler
go build -o go-doc-bundler .

Requirements

  • Go 1.20 or higher
  • Internet connection for crawling

Usage

Crawling Documentation

Crawl a documentation website and create an offline bundle:

./go-doc-bundler crawl -url https://pkg.go.dev/fmt -output ./bundle -depth 2

Crawl Options

  • -url: Starting URL to crawl (required)
  • -output: Output directory for bundled documentation (default: ./bundle)
  • -depth: Maximum crawl depth (default: 3)
  • -domain: Restrict crawling to this domain (optional, inferred from URL if not specified)

Examples

# Crawl Go's fmt package documentation with depth 2
./go-doc-bundler crawl -url https://pkg.go.dev/fmt -depth 2

# Crawl with custom output directory
./go-doc-bundler crawl -url https://docs.example.com -output ./my-docs

# Restrict to specific domain
./go-doc-bundler crawl -url https://example.com/docs -domain example.com -depth 3

Serving Documentation Offline

Start a local web server to browse the bundled documentation:

./go-doc-bundler serve -dir ./bundle -port 8080

Then open http://localhost:8080 in your browser.

Serve Options

  • -dir: Directory containing bundled documentation (default: ./bundle)
  • -port: Port to serve on (default: 8080)

Output Structure

After crawling, the bundle directory contains:

bundle/
├── content/              # Downloaded HTML pages organized by domain and path
│   └── pkg.go.dev/
│       ├── fmt.html
│       └── ...
├── manifest.json         # Metadata about the crawl (URLs, timestamps)
└── search-index.json     # Full-text search index

Manifest Format

The manifest.json file contains:

{
  "version": "1.0",
  "domain": "pkg.go.dev",
  "crawled_at": "2025-12-19T20:19:32Z",
  "page_count": 163,
  "pages": ["url1", "url2", ...]
}

Search Index Format

The search-index.json contains tokenized documents with an inverted index for fast searching.

Architecture

Components

  1. Bundler (pkg/bundler): Core crawling and bundling logic

    • HTTP client with timeout configuration
    • HTML parsing and link extraction
    • URL normalization and rewriting
    • Content deduplication using SHA-256 hashing
    • Deterministic output with sorted URLs
  2. Search Index (pkg/index): Full-text search capabilities

    • Document tokenization
    • Inverted index generation
    • Search query processing
    • JSON serialization with sorted keys for deterministic output
  3. Server (pkg/server): Offline web server

    • Static file serving
    • Search API endpoint
    • Modern web UI with search interface

Deterministic Builds

The tool ensures deterministic builds through:

  • Sorted URL lists in manifest
  • Sorted index keys in search index
  • Consistent hash generation for content deduplication
  • Reproducible output structure

Example Workflow

  1. Crawl documentation:

    ./go-doc-bundler crawl -url https://pkg.go.dev/fmt -depth 1
  2. View the bundled content:

    ls -lh bundle/
    # Shows: content/, manifest.json, search-index.json
  3. Start the local server:

    ./go-doc-bundler serve -dir ./bundle -port 8080
  4. Access the documentation:

    • Open http://localhost:8080 in your browser
    • Use the search box to find specific content
    • Browse the cached pages offline

Technical Details

Content Deduplication

The bundler uses SHA-256 hashing to detect and skip duplicate content:

  • Each page's content is hashed before saving
  • Duplicate hashes are detected and logged
  • Only unique content is stored

URL Rewriting

URLs are automatically rewritten for offline use:

  • Absolute URLs from the same domain are converted to relative paths
  • Links point to locally stored content
  • External links are preserved as-is

Search Index

The search index uses an inverted index structure:

  • Text is tokenized into lowercase words
  • Words shorter than 3 characters are filtered out
  • Documents are ranked by word frequency match
  • Results are sorted deterministically by document ID

License

MIT License - See LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

A Go utility that crawls documentation websites and packages them into an offline bundle with full-text search capabilities. Built with Go's HTTP client and HTML parsing, featuring deterministic builds and content deduplication. A tool for bundling online documentation into offline, searchable archives.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages