Skip to content

coregx/coregex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

coregex

GitHub Release Go Version Go Reference CI Go Report Card License GitHub Stars GitHub Issues GitHub Discussions

High-performance regex engine for Go. Drop-in replacement for regexp with 3-3000x speedup.

Why coregex?

Go's stdlib regexp is intentionally simple — single NFA engine, no optimizations. This guarantees O(n) time but leaves performance on the table.

coregex brings Rust regex-crate architecture to Go:

  • Multi-engine: Lazy DFA, PikeVM, OnePass, BoundedBacktracker
  • SIMD prefilters: AVX2/SSSE3 for fast candidate rejection
  • Reverse search: Suffix/inner literal patterns run 1000x+ faster
  • O(n) guarantee: No backtracking, no ReDoS vulnerabilities

Installation

go get github.com/coregx/coregex

Requires Go 1.25+. Zero external dependencies.

Quick Start

package main

import (
    "fmt"
    "github.com/coregx/coregex"
)

func main() {
    re := coregex.MustCompile(`\w+@\w+\.\w+`)

    text := []byte("Contact support@example.com for help")

    // Find first match
    fmt.Printf("Found: %s\n", re.Find(text))

    // Check if matches (zero allocation)
    if re.MatchString("test@email.com") {
        fmt.Println("Valid email format")
    }
}

Performance

Cross-language benchmarks on 6MB input (source):

Pattern Go stdlib coregex Rust regex vs stdlib
Email validation 259 ms 1.5 ms 1.5 ms 172x
URL extraction 257 ms 1.3 ms 0.8 ms 192x
Suffix .*\.txt 240 ms 1.5 ms 1.3 ms 166x
Inner .*keyword.* 232 ms 1.5 ms 0.6 ms 153x
Char class [\w]+ 550 ms 26 ms 52 ms 21x
Alternation a|b|c 473 ms 31 ms 0.8 ms 15x

Where coregex excels:

  • Suffix patterns (.*\.log, .*\.txt) — reverse search optimization
  • Inner literals (.*error.*, .*@example\.com) — bidirectional DFA
  • Character classes ([\w]+, \d+) — 256-byte lookup table
  • Multi-pattern (foo|bar|baz) — Teddy SIMD algorithm

Known gaps vs Rust:

  • literal_alt — Rust uses Aho-Corasick (planned for coregex)
  • Complex alternations — architectural differences

Features

Engine Selection

coregex automatically selects the optimal engine:

Strategy Pattern Type Speedup
ReverseInner .*keyword.* 1000-3000x
ReverseSuffix .*\.txt 100-400x
CharClassSearcher [\w]+, \d+ 20-25x
Teddy foo|bar|baz 15-240x
LazyDFA Complex with literals 10-50x
OnePass Anchored captures 10x
BoundedBacktracker Small patterns 2-5x

API Compatibility

Drop-in replacement for regexp.Regexp:

// stdlib
re := regexp.MustCompile(pattern)

// coregex — same API
re := coregex.MustCompile(pattern)

Supported methods:

  • Match, MatchString, MatchReader
  • Find, FindString, FindAll, FindAllString
  • FindIndex, FindStringIndex, FindAllIndex
  • FindSubmatch, FindStringSubmatch, FindAllSubmatch
  • ReplaceAll, ReplaceAllString, ReplaceAllFunc
  • Split, SubexpNames, NumSubexp
  • Longest, Copy, String

Zero-Allocation APIs

// Zero allocations — returns bool
matched := re.IsMatch(text)

// Zero allocations — returns (start, end, found)
start, end, found := re.FindIndices(text)

Configuration

config := coregex.DefaultConfig()
config.DFAMaxStates = 10000      // Limit DFA cache
config.EnablePrefilter = true    // SIMD acceleration

re, err := coregex.CompileWithConfig(pattern, config)

Syntax Support

Uses Go's regexp/syntax parser:

Feature Support
Character classes [a-z], \d, \w, \s
Quantifiers *, +, ?, {n,m}
Anchors ^, $, \b, \B
Groups (...), (?:...), (?P<name>...)
Unicode \p{L}, \P{N}
Flags (?i), (?m), (?s)
Backreferences Not supported (O(n) guarantee)

Architecture

Pattern → Parse → NFA → Literal Extract → Strategy Select
                                               ↓
                         ┌─────────────────────────────────┐
                         │ Engines:                        │
                         │  LazyDFA, PikeVM, OnePass,      │
                         │  BoundedBacktracker,            │
                         │  ReverseInner, ReverseSuffix,   │
                         │  CharClassSearcher, Teddy       │
                         └─────────────────────────────────┘
                                               ↓
Input → Prefilter (SIMD) → Engine → Match Result

SIMD Primitives (AMD64):

  • memchr — single byte search (AVX2)
  • memmem — substring search (SSSE3)
  • teddy — multi-pattern search (SSSE3)

Pure Go fallback on other architectures.

Battle-Tested

coregex is integrated in GoAWK by Ben Hoyt. This real-world testing uncovered 15+ edge cases that synthetic benchmarks missed.

We need more testers! If you have a project using regexp, try coregex and report issues.

Documentation

Comparison

coregex stdlib regexp2
Performance 3-3000x faster Baseline Slower
SIMD AVX2/SSSE3 No No
O(n) guarantee Yes Yes No
Backreferences No No Yes
API Drop-in Different

Use coregex for performance-critical code with O(n) guarantee. Use stdlib for simple cases where performance doesn't matter. Use regexp2 if you need backreferences (accept exponential worst-case).

Related

Inspired by:

License

MIT — see LICENSE.


Status: Pre-1.0 (API may change). Ready for testing and feedback.

Releases · Issues · Discussions

About

Pure Go production-grade regex engine with SIMD optimizations. Up to 3-3000x+ faster than stdlib.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published