Gollum - The Web Spider

A lightweight web crawler designed to run on low-memory devices (1GB RAM) while respectfully(Robots.txt) searching the web for a given keywords.

Warning

The current Python implementation faces several memory-related challenges:

Python's runtime overhead
Tracking visited URLs in memory is exponential To address these problems, I rewrote the spider in Go, therefore this is an "old" repository. Works if you avoid backlink hells aka wikipedia and the archive

Installation

# Install dependencies
pip install -r requirements.txt

# Make Gollum executable
chmod +x Gollum.py

Usage

Basic Usage (Single Keyword)

python Gollum.py "Ring"

Multiple Keywords

python Gollum.py "Ring" "Precious" "Sméagol"

Advanced Options

# Custom delay range (min, max seconds)
python Gollum.py "Ring" --delay 2 5

# Custom database file
python Gollum.py "Ring" --db my_precious.db

# Starting URL
python Gollum.py "Ring" --start-url "https://example.com"

# Multi-threading
python Gollum.py "Ring" --threads 2

Database Structure

precious_finds (Findings)

keyword: The keyword that was found
url: Where it was found
date_found: When it was discovered

crawled_urls

url: URLs that have been crawled
last_crawled: When they were last visited

ElveLogs Statistics

Gollum periodically dumps statistics to ElveLogs.json containing:

Timestamp and runtime
URLs visited count
Precious finds count
Crawling rate (URLs per hour)
Current queue size
Memory usage stats
Gollum's current status

Memory Optimization

Limits in-memory URL queue to 1000 items
Limits visited URLs memory to 10,000 items
Automatically cleans memory every 100 crawls
Uses (sets, deques)

Ban Prevention

Respects robots.txt files
Random delays between requests
Custom (Self Announcing )User-Agent string
Avoids re-crawling recently visited URLs

Emergency URL System

When Gollum runs low on URLs, it automatically adds:

Reddit front page
Hacker News
Stack Overflow
GitHub trending
Wikipedia random articles
Medium topics

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ElveLogs.json		ElveLogs.json
README.md		README.md
check_golam.py		check_golam.py
config.py		config.py
demo_golam.py		demo_golam.py
examples.sh		examples.sh
golam.log		golam.log
golam.py		golam.py
golam_precious.db		golam_precious.db
golam_precious.db-journal		golam_precious.db-journal
requirements.txt		requirements.txt
start_golam.sh		start_golam.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gollum - The Web Spider

Warning

Installation

Usage

Basic Usage (Single Keyword)

Multiple Keywords

Advanced Options

Database Structure

precious_finds (Findings)

crawled_urls

ElveLogs Statistics

Memory Optimization

Ban Prevention

Emergency URL System

About

Uh oh!

Releases

Packages

Languages

benar-m/Keyword-Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Gollum - The Web Spider

Warning

Installation

Usage

Basic Usage (Single Keyword)

Multiple Keywords

Advanced Options

Database Structure

precious_finds (Findings)

crawled_urls

ElveLogs Statistics

Memory Optimization

Ban Prevention

Emergency URL System

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages