The creg application is a POSIX/GNU regular expression commandline tool for searching with patterns in text-strings or text-files. It implements the functions of the compact-regex.h extensions library. ( https://github.com/nowca/compact-regex )
- fast regex testing
- text replacement function
- reads large text files (up to 8 MB or more) with parameter or redirected text stream
- structured and colored display output with filters
- file write export
- different output formats and layouts (table, list, plain ASCII, CSV, JSON)
- options of the
regex.hlibrary with extended functionalites - can be run on Linux, Windows , Mac and all GNU C compatible platforms
- How to use
- Examples
- Installation
- Compilation
- Commandline options
- Supported Regular Expression operations
- POSIX Standard
- Character classes
user@pc:~$ creg "abc DEF xyz ABC 123" "\d+"- find digit string
\d+in the textabc DEF xyz ABC 123
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "abc" -f i- find string
-r "abc"in the text-t "abc DEF xyz ABC 123" -f i: flag (insensitive case)
user@pc:~$ creg --text "abc DEF xyz ABC 123" --regex "abc" --option-flags i- find string
--regex "abc"in the text--text "abc DEF xyz ABC 123" --option-flags i: flag (insensitive case)
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "[\w ]+[^0-9]+" -p plain -d r- find string of words without numbers
-r "[\w ]+[^0-9]+"in the text-t "abc DEF xyz ABC 123" -d r: display just the results-p plain: just as text
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "[\w]+" -p json -d r- find all words
-r "[\w]+"in the text-t "abc DEF xyz ABC 123" -d r: display just the results-p json: in csv-format
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "[a-z0-9]+" -x "###"- replace all words
-r "[a-z0-9]+"with lowercase or numbers in the text-t "abc DEF xyz ABC 123"with the string-x "###"
user@pc:~$ creg -t "abc DEF xyz ABC 123" -r "(a)(b)(c)" -x "\3\2\1" -f gi- replace each letter of "abc"
-r "(a)(b)(c)"with the reverse letters "cba"-x "\3\2\1"with the string### -f i: flag (insensitive case)
user@pc:~$ cat service-names-port-numbers.csv | ./creg -r "(\\d+);(.*UDP.*);(.*mail.*);" -c -f gein -d srp- display the file contents of
service-names-port-numbers.csvwithcatand readSTDOUTwith piping redirection -r: match all UDP based protocols which contains the word mail with the options:-c: colored output-f gein: flags (global, extended, insensitive case, newline)-d srp: display statistics, results and index postions
user@pc:~$ ./creg -i ./example-files/oxford-word-list.txt -r "^(Ae.*ion) (.+\.) (.*)$" -p list -f gei -c -d sr-i: read in the file./example-files/oxford-word-list.txt-r: match all lines (from^to$) with words, that start withAeand end withionwith the options:-c: colored output-p list: list-format-f gei: flags (global, extended, insensitive case)-d sr: display statistics, results, without the index postions
Z:\>creg.exe /I "example-files\windows-formatted-regfile.reg" /R ".*HKEY-CLASSES_ROOT.*" /D TSR \I: read in the fileexample-files\windows-formatted-regfile.reg\R: match all lines that contain the phrase "HKEY-CLASSES_ROOT" with the options:\D TSR: display text, statistics, results, without the index postions
The input file can also be redirected in with the windows-cmd pipeline command:
Z:\>more port-numbers.csv | creg.exe /R "^.*mail.*$" /D sr /F gein /P list more port-numbers.csv |: show contents of the file and redirect it with|\R: match all lines that contain the phrase "mail" with the options:\D sr: display statistics, results, without the index postions\F gein: flags (global, extended, insensitive case, newline)\P list: short list format
The program can be compiled and copied to the /opt/ folder.
Just run:
user@pc:~$ makeand
user@pc:~$ sudo make installBuild the example program by typing in:
user@pc:~$ make...or compile it directly with the GNU-C-Compiler:
user@pc:~$ gcc -Wall -static creg.c -o creg-
The GNU Extensions with the regex.h library are needed for successful compilation. Please take care of including the neccesary header and library files.
-
Use the
-m32flag to compile the program for 32 Bit systems. -
Important note: The program will be compiled with the
-staticflag, to combine the libraries into the code, there will be some memory leaks showed in valgrind. These errors are supressed on dynamically linking by default. (https://stackoverflow.com/questions/7506134/valgrind-errors-when-linked-with-static-why)
To compile the program on windows, you will need a compiler version with the regex.h library, from GNU extensions included:
C:\Users\pcuser>gcc.exe -static -IC:\MinGW-W64\mingw32\opt\include creg.c -o creg.exe -LC:\MinGW-W64\mingw32\opt\lib -lregex- MinGW-W64 includes the regex.h library in the
\opt\includeand\opt\libfolders. - The paths of the header and library must be included with
-Iand-L, with an additional-lregexparameter at the end of the command. -staticcan be used to make your project independend from libraries.- The path of gcc.exe must be added to the Windows PATH user-variable
To compile the program on MacOS or OS X, you will need a compiler version with the regex.h library, from GNU extensions included:
-
There are several ways to install the GCC development tools on your Mac:
- Xcode
- Homebrew
- MacPorts
- sourcecode compilation
- graphical package installer like Bower or MacUpdate
-
You need a GCC installation with the
regex.hlibrary (GNU Extensions). -
For compiler options see Linux.
- see
-hc or--help` to read all the options
creg [Commands] [Options]
| Command: | Meaning: |
|---|---|
-t <input-text>, --text <input-text> |
text input string |
-r <expression>, --regex <expression> |
regular expression pattern |
-x <replace-text>, --replace <replace-text> |
replacement text substring |
-i <filename>, --input <filename> |
filepath to read in file |
-o <filename>, --output <filename> |
filepath to write out file |
-h, --help |
show help for commands |
| Command: | Meaning: |
|---|---|
-d <data>, --data <data> |
show output elements |
<data>:
| Argument: | Meaning: |
|---|---|
t |
input text |
s |
statistics |
r |
results |
p |
match index positions |
usage example:
-d tsrp or --data sr
| Command: | Meaning: |
|---|---|
-p <print-layout>, --print <print-layout> |
printing or file writing layout |
<print-layout>:
| Argument: | Meaning: |
|---|---|
table |
table |
list |
short list |
list-full |
full list |
plain |
plain result data |
csv |
comma-seperated values |
json |
JavaScript Object Notation |
| Command: | Meaning: |
|---|---|
-c, --color |
display with ANSI colors |
| Command: | Meaning: |
|---|---|
-f <options>, --option-flags <options> |
option-flags for compilation |
<options>:
| Argument: | Meaning: |
|---|---|
g: global |
search for all matches in a text |
e: extended |
use Extended Regular Expressions (ERE) |
i: icase |
use insensitive case matching |
m: multiline |
search in multiple lines |
n: newline |
ignore the newline character |
p: nosubexp |
ignore group matching with subexpressions |
q: subexp |
match only subexpressions |
usage example:
-f ge or --option-flags geinq
default options:
- global, extended, newline (the default options are deactivated, if an option is set with the -f command)
| Command: | Meaning: |
|---|---|
-s <length>, --max-text-size <length> |
max input-text length in bytes, default: 8388608 bytes (8 MB) |
-n <count>, --max-num-matches <count> |
max number of matches, default: 8192 matches |
The program supports POSIX compatible Regular Expressions from regex.h with some extended functionalities, like single character classes.
| Supported: | Not supported: |
|---|---|
Wildcard . |
Lazy *? +? ?? |
Character classes \d \D \w \W |
Negative Lookahead (?!) |
POSIX character classes [:digit:] |
Negative Lookbehind (?<!) |
Whitespace \s \S |
Positive Lookahead (?<=) |
Character Sets [abc] |
Positive Lookbehind (?<=) |
Escaping \ |
|
The Asterisk * |
|
The Plus + |
|
The Question Mark ? |
|
Numeric Quantifier {n} |
|
Range Quantifier {n,m} |
|
| Alternation ` | ` |
Anchors ^ $ |
|
Capturing Groups a(b)c |
|
Backreferences \1 |
|
| ASCII and Unicode sequences |
| Metacharacter | Description |
|---|---|
| ^ | Matches the starting position within the string. In line-based tools, it matches the starting position of any line. |
| . | Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c". |
| [ ] | A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z]. The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-], [-abc], [^-abc]. Backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^, if present) character: []abc], [^]abc]. |
| [^ ] | Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed. |
| $ | Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line. |
| ( ) | Defines a marked subexpression, also called a capturing group, which is essential for extracting the desired part of the text (See also the next entry, \n). BRE mode requires ( ). |
| \n | Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is defined in the POSIX standard.[36] Some tools allow referencing more than nine capturing groups. Also known as a back-reference, this feature is supported in BRE mode. |
| * | Matches the preceding element zero or more times. For example, abc matches "ac", "abc", "abbbc", etc. [xyz] matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. (ab)* matches "", "ab", "abab", "ababab", and so on. |
| {m,n} | Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few older instances of regexes. BRE mode requires {m,n}. |
| Metacharacter | Description |
|---|---|
| ? | Matches the preceding element zero or one time. For example, ab?c matches only "ac" or "abc". |
| + | Matches the preceding element one or more times. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". |
| | | The choice (also known as alternation or set union) operator matches either the expression before or the expression after the operator. For example, abc |
Source: https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended
| Description | POSIX | Shortcode | ASCII |
|---|---|---|---|
| ASCII characters | \x[Bytecode] | ||
| Alphanumeric characters | [:alnum:] | [A-Za-z0-9] | |
| Alphanumeric characters plus "_" | \w | [A-Za-z0-9_] | |
| Non-word characters | \W | [^A-Za-z0-9_] | |
| Alphabetic characters | [:alpha:] | \a | [A-Za-z] |
| Space and tab | [:space:] | \s | |
| [:blank:] | \t | ||
| Non-whitespace characters | \S | [^ ] | |
| Word boundaries | \b | ||
| Non-word boundaries | \B | ||
| Digits | [:digit:] | \d | [0-9] |
| Non-digits | \D | [^0-9] | |
| Lowercase letters | [:lower:] | \l | [a-z] |
| Uppercase letters | [:upper:] | \u | [A-Z] |
| Visible characters | [:print:] | \p | [\x20-\x7E] |


