Skip to content

A Python WebCrawler, Takes a Keywords and Crawls logging where it was found in a sqlite database

Notifications You must be signed in to change notification settings

benar-m/Keyword-Web-Crawler

Repository files navigation

Gollum - The Web Spider

A lightweight web crawler designed to run on low-memory devices (1GB RAM) while respectfully(Robots.txt) searching the web for a given keywords.

Warning

The current Python implementation faces several memory-related challenges:

  • Python's runtime overhead
  • Tracking visited URLs in memory is exponential To address these problems, I rewrote the spider in Go, therefore this is an "old" repository. Works if you avoid backlink hells aka wikipedia and the archive

Installation

# Install dependencies
pip install -r requirements.txt

# Make Gollum executable
chmod +x Gollum.py

Usage

Basic Usage (Single Keyword)

python Gollum.py "Ring"

Multiple Keywords

python Gollum.py "Ring" "Precious" "Sméagol"

Advanced Options

# Custom delay range (min, max seconds)
python Gollum.py "Ring" --delay 2 5

# Custom database file
python Gollum.py "Ring" --db my_precious.db

# Starting URL
python Gollum.py "Ring" --start-url "https://example.com"

# Multi-threading
python Gollum.py "Ring" --threads 2

Database Structure

precious_finds (Findings)

  • keyword: The keyword that was found
  • url: Where it was found
  • date_found: When it was discovered

crawled_urls

  • url: URLs that have been crawled
  • last_crawled: When they were last visited

ElveLogs Statistics

Gollum periodically dumps statistics to ElveLogs.json containing:

  • Timestamp and runtime
  • URLs visited count
  • Precious finds count
  • Crawling rate (URLs per hour)
  • Current queue size
  • Memory usage stats
  • Gollum's current status

Memory Optimization

  • Limits in-memory URL queue to 1000 items
  • Limits visited URLs memory to 10,000 items
  • Automatically cleans memory every 100 crawls
  • Uses (sets, deques)

Ban Prevention

  • Respects robots.txt files
  • Random delays between requests
  • Custom (Self Announcing )User-Agent string
  • Avoids re-crawling recently visited URLs

Emergency URL System

When Gollum runs low on URLs, it automatically adds:

  • Reddit front page
  • Hacker News
  • Stack Overflow
  • GitHub trending
  • Wikipedia random articles
  • Medium topics

About

A Python WebCrawler, Takes a Keywords and Crawls logging where it was found in a sqlite database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published