pygarble

Detect gibberish, garbled text, and nonsense with high precision.

A zero-dependency Python library for identifying random character sequences, keyboard mashing, encoding errors, and other forms of text corruption. Uses statistical analysis, phonotactic rules, and pattern matching to distinguish meaningful text from gibberish.

Installation

pip install pygarble

Quick Start

from pygarble import GarbleDetector, EnsembleDetector, Strategy

# Recommended: Use the default ensemble (99.5% precision)
detector = EnsembleDetector()
detector.predict("Hello world")      # False - valid text
detector.predict("asdfghjkl")        # True - keyboard mashing
detector.predict("qxzjkwp")          # True - impossible letter combinations

# Get probability scores (0.0 = valid, 1.0 = gibberish)
detector.predict_proba("Hello world")  # ~0.1
detector.predict_proba("xkqzjwp")      # ~0.9

# Batch processing
texts = ["Hello world", "asdfghjkl", "Normal sentence here"]
results = detector.predict(texts)      # [False, True, False]

Performance

Tested on 1,644 samples (dictionary words, sentences, random strings, keyboard mashing):

Detector	Precision	Recall	F1 Score
EnsembleDetector()	99.5%	78.5%	87.8%
MARKOV_CHAIN	98.8%	86.4%	92.2%
BIGRAM_PROBABILITY	100%	33.6%	50.3%

The default ensemble prioritizes precision (minimizing false positives) over recall.

Detection Strategies

Recommended Strategies

Strategy	Description	Precision
`MARKOV_CHAIN`	Character transition probabilities trained on English	98.8%
`NGRAM_FREQUENCY`	Common English trigram analysis	96.3%
`WORD_LOOKUP`	Dictionary of 50K English words	92.7%
`BIGRAM_PROBABILITY`	Impossible letter pairs (qx, jj, zx)	100%
`LETTER_POSITION`	Letters in impossible positions	99.0%

All Available Strategies

High Precision (v0.5.0)

BIGRAM_PROBABILITY - Impossible letter pairs
LETTER_POSITION - Invalid letter positions
CONSONANT_SEQUENCE - Too many consecutive consonants
VOWEL_PATTERN - Invalid vowel sequences
LETTER_FREQUENCY - Abnormal letter distribution
RARE_TRIGRAM - Impossible trigrams

Core Strategies

MARKOV_CHAIN - Character-level Markov chain (best overall)
NGRAM_FREQUENCY - Trigram frequency analysis
WORD_LOOKUP - English dictionary lookup
PRONOUNCEABILITY - English phonotactic rules
KEYBOARD_PATTERN - Keyboard row sequences
ENTROPY_BASED - Shannon entropy analysis
VOWEL_RATIO - Vowel to consonant ratio

Specialized Detectors

MOJIBAKE - Encoding corruption (UTF-8 as Latin-1)
UNICODE_SCRIPT - Homoglyph/script mixing attacks
HEX_STRING - Hash strings and UUIDs
SYMBOL_RATIO - Excessive symbols/numbers
REPETITION - Repeated patterns (ababab)
COMPRESSION_RATIO - Compression-based detection

Legacy Strategies

CHARACTER_FREQUENCY, WORD_LENGTH, PATTERN_MATCHING, STATISTICAL_ANALYSIS
ENGLISH_WORD_VALIDATION (requires pip install pygarble[spellchecker])

Using Individual Strategies

from pygarble import GarbleDetector, Strategy

# Markov chain - best overall performance
detector = GarbleDetector(Strategy.MARKOV_CHAIN)
detector.predict("the quick brown fox")  # False
detector.predict("xkqzjwpmv")            # True

# High precision - zero false positives
detector = GarbleDetector(Strategy.BIGRAM_PROBABILITY)
detector.predict("hello world")          # False
detector.predict("qxjjxz")               # True (impossible: qx, jj, xz)

# Encoding corruption detection
detector = GarbleDetector(Strategy.MOJIBAKE)
detector.predict("Café")                 # False - valid UTF-8
detector.predict("CafÃ©")                # True - mojibake

# Homoglyph attack detection
detector = GarbleDetector(Strategy.UNICODE_SCRIPT)
detector.predict("paypal")               # False - all Latin
detector.predict("pаypal")               # True - Cyrillic 'а'

Ensemble Detector

Combine multiple strategies for better accuracy:

from pygarble import EnsembleDetector, Strategy

# Default ensemble (recommended)
# Uses: MARKOV_CHAIN, WORD_LOOKUP, NGRAM_FREQUENCY, BIGRAM_PROBABILITY, LETTER_POSITION
# Voting: majority
detector = EnsembleDetector()

# Custom strategies
detector = EnsembleDetector(
    strategies=[
        Strategy.MARKOV_CHAIN,
        Strategy.BIGRAM_PROBABILITY,
        Strategy.KEYBOARD_PATTERN,
    ]
)

# Different voting modes
detector = EnsembleDetector(voting="any")       # High recall - flag if ANY strategy detects
detector = EnsembleDetector(voting="all")       # High precision - flag only if ALL agree
detector = EnsembleDetector(voting="majority")  # Balanced (default)
detector = EnsembleDetector(voting="average")   # Average probabilities

# Weighted voting
detector = EnsembleDetector(
    strategies=[Strategy.MARKOV_CHAIN, Strategy.WORD_LOOKUP],
    voting="weighted",
    weights=[0.7, 0.3]
)

API Reference

GarbleDetector

GarbleDetector(
    strategy: Strategy,
    threshold: float = 0.5,    # Probability threshold for predict()
    **kwargs                   # Strategy-specific parameters
)

# Methods
detector.predict(text)         # Returns bool or List[bool]
detector.predict_proba(text)   # Returns float or List[float] (0.0-1.0)

EnsembleDetector

EnsembleDetector(
    strategies: List[Strategy] = None,  # Default: high-precision mix
    threshold: float = 0.5,
    voting: str = "majority",           # "majority", "any", "all", "average", "weighted"
    weights: List[float] = None,        # Required if voting="weighted"
)

# Methods (same as GarbleDetector)
detector.predict(text)
detector.predict_proba(text)

Common Use Cases

Filter User Input

detector = EnsembleDetector()

def validate_input(text):
    if detector.predict(text):
        return "Please enter valid text"
    return None

Clean Data Pipeline

detector = GarbleDetector(Strategy.MARKOV_CHAIN)

clean_data = [text for text in raw_data if not detector.predict(text)]

Detect Encoding Issues

detector = GarbleDetector(Strategy.MOJIBAKE)

for text in documents:
    if detector.predict(text):
        print(f"Encoding issue detected: {text[:50]}...")

Detect Phishing/Homoglyphs

detector = GarbleDetector(Strategy.UNICODE_SCRIPT)

if detector.predict(domain_name):
    print("Warning: Possible homoglyph attack")

Requirements

Python 3.8+
Zero dependencies (core library)
Optional: pyspellchecker for ENGLISH_WORD_VALIDATION strategy

Development

git clone https://github.com/brightertiger/pygarble.git
cd pygarble
pip install -e ".[dev]"
pytest tests/ -v

License

MIT License

Changelog

0.5.0

6 new high-precision strategies (BIGRAM_PROBABILITY, LETTER_POSITION, CONSONANT_SEQUENCE, VOWEL_PATTERN, LETTER_FREQUENCY, RARE_TRIGRAM)
Redesigned default ensemble for 99.5% precision
External validation benchmark (1,644 test cases)

0.4.0

Added COMPRESSION_RATIO, MOJIBAKE, PRONOUNCEABILITY, UNICODE_SCRIPT strategies

0.3.0

Zero-dependency core with embedded training data
Added MARKOV_CHAIN, NGRAM_FREQUENCY, WORD_LOOKUP strategies

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
docs		docs
pygarble		pygarble
regression		regression
scripts		scripts
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pygarble

Installation

Quick Start

Performance

Detection Strategies

Recommended Strategies

All Available Strategies

Using Individual Strategies

Ensemble Detector

API Reference

GarbleDetector

EnsembleDetector

Common Use Cases

Filter User Input

Clean Data Pipeline

Detect Encoding Issues

Detect Phishing/Homoglyphs

Requirements

Development

License

Changelog

0.5.0

0.4.0

0.3.0

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

brightertiger/pygarble

Folders and files

Latest commit

History

Repository files navigation

pygarble

Installation

Quick Start

Performance

Detection Strategies

Recommended Strategies

All Available Strategies

Using Individual Strategies

Ensemble Detector

API Reference

GarbleDetector

EnsembleDetector

Common Use Cases

Filter User Input

Clean Data Pipeline

Detect Encoding Issues

Detect Phishing/Homoglyphs

Requirements

Development

License

Changelog

0.5.0

0.4.0

0.3.0

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages