Skip to content

brightertiger/pygarble

Repository files navigation

pygarble

Detect gibberish, garbled text, and nonsense with high precision.

A zero-dependency Python library for identifying random character sequences, keyboard mashing, encoding errors, and other forms of text corruption. Uses statistical analysis, phonotactic rules, and pattern matching to distinguish meaningful text from gibberish.

Installation

pip install pygarble

Quick Start

from pygarble import GarbleDetector, EnsembleDetector, Strategy

# Recommended: Use the default ensemble (99.5% precision)
detector = EnsembleDetector()
detector.predict("Hello world")      # False - valid text
detector.predict("asdfghjkl")        # True - keyboard mashing
detector.predict("qxzjkwp")          # True - impossible letter combinations

# Get probability scores (0.0 = valid, 1.0 = gibberish)
detector.predict_proba("Hello world")  # ~0.1
detector.predict_proba("xkqzjwp")      # ~0.9

# Batch processing
texts = ["Hello world", "asdfghjkl", "Normal sentence here"]
results = detector.predict(texts)      # [False, True, False]

Performance

Tested on 1,644 samples (dictionary words, sentences, random strings, keyboard mashing):

Detector Precision Recall F1 Score
EnsembleDetector() 99.5% 78.5% 87.8%
MARKOV_CHAIN 98.8% 86.4% 92.2%
BIGRAM_PROBABILITY 100% 33.6% 50.3%

The default ensemble prioritizes precision (minimizing false positives) over recall.

Detection Strategies

Recommended Strategies

Strategy Description Precision
MARKOV_CHAIN Character transition probabilities trained on English 98.8%
NGRAM_FREQUENCY Common English trigram analysis 96.3%
WORD_LOOKUP Dictionary of 50K English words 92.7%
BIGRAM_PROBABILITY Impossible letter pairs (qx, jj, zx) 100%
LETTER_POSITION Letters in impossible positions 99.0%

All Available Strategies

High Precision (v0.5.0)

  • BIGRAM_PROBABILITY - Impossible letter pairs
  • LETTER_POSITION - Invalid letter positions
  • CONSONANT_SEQUENCE - Too many consecutive consonants
  • VOWEL_PATTERN - Invalid vowel sequences
  • LETTER_FREQUENCY - Abnormal letter distribution
  • RARE_TRIGRAM - Impossible trigrams

Core Strategies

  • MARKOV_CHAIN - Character-level Markov chain (best overall)
  • NGRAM_FREQUENCY - Trigram frequency analysis
  • WORD_LOOKUP - English dictionary lookup
  • PRONOUNCEABILITY - English phonotactic rules
  • KEYBOARD_PATTERN - Keyboard row sequences
  • ENTROPY_BASED - Shannon entropy analysis
  • VOWEL_RATIO - Vowel to consonant ratio

Specialized Detectors

  • MOJIBAKE - Encoding corruption (UTF-8 as Latin-1)
  • UNICODE_SCRIPT - Homoglyph/script mixing attacks
  • HEX_STRING - Hash strings and UUIDs
  • SYMBOL_RATIO - Excessive symbols/numbers
  • REPETITION - Repeated patterns (ababab)
  • COMPRESSION_RATIO - Compression-based detection

Legacy Strategies

  • CHARACTER_FREQUENCY, WORD_LENGTH, PATTERN_MATCHING, STATISTICAL_ANALYSIS
  • ENGLISH_WORD_VALIDATION (requires pip install pygarble[spellchecker])

Using Individual Strategies

from pygarble import GarbleDetector, Strategy

# Markov chain - best overall performance
detector = GarbleDetector(Strategy.MARKOV_CHAIN)
detector.predict("the quick brown fox")  # False
detector.predict("xkqzjwpmv")            # True

# High precision - zero false positives
detector = GarbleDetector(Strategy.BIGRAM_PROBABILITY)
detector.predict("hello world")          # False
detector.predict("qxjjxz")               # True (impossible: qx, jj, xz)

# Encoding corruption detection
detector = GarbleDetector(Strategy.MOJIBAKE)
detector.predict("Café")                 # False - valid UTF-8
detector.predict("Café")                # True - mojibake

# Homoglyph attack detection
detector = GarbleDetector(Strategy.UNICODE_SCRIPT)
detector.predict("paypal")               # False - all Latin
detector.predict("pаypal")               # True - Cyrillic 'а'

Ensemble Detector

Combine multiple strategies for better accuracy:

from pygarble import EnsembleDetector, Strategy

# Default ensemble (recommended)
# Uses: MARKOV_CHAIN, WORD_LOOKUP, NGRAM_FREQUENCY, BIGRAM_PROBABILITY, LETTER_POSITION
# Voting: majority
detector = EnsembleDetector()

# Custom strategies
detector = EnsembleDetector(
    strategies=[
        Strategy.MARKOV_CHAIN,
        Strategy.BIGRAM_PROBABILITY,
        Strategy.KEYBOARD_PATTERN,
    ]
)

# Different voting modes
detector = EnsembleDetector(voting="any")       # High recall - flag if ANY strategy detects
detector = EnsembleDetector(voting="all")       # High precision - flag only if ALL agree
detector = EnsembleDetector(voting="majority")  # Balanced (default)
detector = EnsembleDetector(voting="average")   # Average probabilities

# Weighted voting
detector = EnsembleDetector(
    strategies=[Strategy.MARKOV_CHAIN, Strategy.WORD_LOOKUP],
    voting="weighted",
    weights=[0.7, 0.3]
)

API Reference

GarbleDetector

GarbleDetector(
    strategy: Strategy,
    threshold: float = 0.5,    # Probability threshold for predict()
    **kwargs                   # Strategy-specific parameters
)

# Methods
detector.predict(text)         # Returns bool or List[bool]
detector.predict_proba(text)   # Returns float or List[float] (0.0-1.0)

EnsembleDetector

EnsembleDetector(
    strategies: List[Strategy] = None,  # Default: high-precision mix
    threshold: float = 0.5,
    voting: str = "majority",           # "majority", "any", "all", "average", "weighted"
    weights: List[float] = None,        # Required if voting="weighted"
)

# Methods (same as GarbleDetector)
detector.predict(text)
detector.predict_proba(text)

Common Use Cases

Filter User Input

detector = EnsembleDetector()

def validate_input(text):
    if detector.predict(text):
        return "Please enter valid text"
    return None

Clean Data Pipeline

detector = GarbleDetector(Strategy.MARKOV_CHAIN)

clean_data = [text for text in raw_data if not detector.predict(text)]

Detect Encoding Issues

detector = GarbleDetector(Strategy.MOJIBAKE)

for text in documents:
    if detector.predict(text):
        print(f"Encoding issue detected: {text[:50]}...")

Detect Phishing/Homoglyphs

detector = GarbleDetector(Strategy.UNICODE_SCRIPT)

if detector.predict(domain_name):
    print("Warning: Possible homoglyph attack")

Requirements

  • Python 3.8+
  • Zero dependencies (core library)
  • Optional: pyspellchecker for ENGLISH_WORD_VALIDATION strategy

Development

git clone https://github.com/brightertiger/pygarble.git
cd pygarble
pip install -e ".[dev]"
pytest tests/ -v

License

MIT License

Changelog

0.5.0

  • 6 new high-precision strategies (BIGRAM_PROBABILITY, LETTER_POSITION, CONSONANT_SEQUENCE, VOWEL_PATTERN, LETTER_FREQUENCY, RARE_TRIGRAM)
  • Redesigned default ensemble for 99.5% precision
  • External validation benchmark (1,644 test cases)

0.4.0

  • Added COMPRESSION_RATIO, MOJIBAKE, PRONOUNCEABILITY, UNICODE_SCRIPT strategies

0.3.0

  • Zero-dependency core with embedded training data
  • Added MARKOV_CHAIN, NGRAM_FREQUENCY, WORD_LOOKUP strategies

About

Python Package to detect garbled, gibberish text for EN

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •