A lightweight web crawler designed to run on low-memory devices (1GB RAM) while respectfully(Robots.txt) searching the web for a given keywords.
The current Python implementation faces several memory-related challenges:
- Python's runtime overhead
- Tracking visited URLs in memory is exponential To address these problems, I rewrote the spider in Go, therefore this is an "old" repository. Works if you avoid backlink hells aka wikipedia and the archive
# Install dependencies
pip install -r requirements.txt
# Make Gollum executable
chmod +x Gollum.pypython Gollum.py "Ring"python Gollum.py "Ring" "Precious" "Sméagol"# Custom delay range (min, max seconds)
python Gollum.py "Ring" --delay 2 5
# Custom database file
python Gollum.py "Ring" --db my_precious.db
# Starting URL
python Gollum.py "Ring" --start-url "https://example.com"
# Multi-threading
python Gollum.py "Ring" --threads 2keyword: The keyword that was foundurl: Where it was founddate_found: When it was discovered
url: URLs that have been crawledlast_crawled: When they were last visited
Gollum periodically dumps statistics to ElveLogs.json containing:
- Timestamp and runtime
- URLs visited count
- Precious finds count
- Crawling rate (URLs per hour)
- Current queue size
- Memory usage stats
- Gollum's current status
- Limits in-memory URL queue to 1000 items
- Limits visited URLs memory to 10,000 items
- Automatically cleans memory every 100 crawls
- Uses (sets, deques)
- Respects robots.txt files
- Random delays between requests
- Custom (Self Announcing )User-Agent string
- Avoids re-crawling recently visited URLs
When Gollum runs low on URLs, it automatically adds:
- Reddit front page
- Hacker News
- Stack Overflow
- GitHub trending
- Wikipedia random articles
- Medium topics