-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Implement comprehensive IPv4 and IPv6 address pattern detection and validation within the semantic classification system to identify and tag IP addresses found in binary strings.
Context
The semantic classifier currently supports URL and domain detection. IP addresses are a critical type of network indicator that appear frequently in binaries (C&C addresses, configuration endpoints, telemetry servers, hardcoded network targets). Adding IPv4 and IPv6 detection will enable security analysts and reverse engineers to quickly identify potential network indicators of compromise (IOCs).
Current State
Tag::IPv4andTag::IPv6enum variants are already defined insrc/types.rs(lines 18-19)src/classification/mod.rsexists but is currently empty- The semantic tagging infrastructure is in place but needs pattern matching implementation
- Architecture supports regex-based classification per
concept.md
Dependencies
- Blocked by: URL and Domain Classification (Implement URL and Domain Pattern Matching in Semantic Classification System #15 or related)
- The implementation pattern established for URL/Domain detection should be followed for consistency
- Shared validation utilities may be needed
Proposed Solution
Implementation Approach
Implement IP address detection in src/classification/mod.rs (or a dedicated submodule) with the following components:
1. IPv4 Pattern Matching
Pattern: XXX.XXX.XXX.XXX where each octet is 0-255
Validation Rules:
- Four octets separated by dots
- Each octet must be 0-255 (no leading zeros except for "0" itself)
- No leading/trailing dots
- Exclude invalid ranges for context (e.g., 0.0.0.0, 255.255.255.255 may be flagged with lower scores)
Example Valid: 192.168.1.1, 10.0.0.1, 172.16.0.1, 8.8.8.8
Example Invalid: 256.1.1.1, 192.168.1, 192.168.1.1.1, 192.168.01.1 (leading zero)
Regex Pattern (starting point):
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
2. IPv6 Pattern Matching
Format Support:
- Full notation:
2001:0db8:85a3:0000:0000:8a2e:0370:7334 - Compressed notation:
2001:db8:85a3::8a2e:370:7334 - Mixed notation (IPv4-mapped):
::ffff:192.0.2.1 - Loopback:
::1 - Link-local:
fe80::1
Validation Rules:
- Eight groups of 4 hexadecimal digits separated by colons
- Double colon
::allowed once to represent consecutive zeros - Trailing/embedded IPv4 addresses in mixed notation
- Case-insensitive hex digits (a-f, A-F)
Example Valid:
2001:db8::1fe80::1::12001:0db8:85a3::8a2e:0370:7334::ffff:192.0.2.1
Implementation Note: IPv6 regex validation is complex. Consider using the ipnetwork or std::net::IpAddr for validation after initial pattern matching.
3. Classification Function Structure
pub fn classify_string(text: &str) -> Vec<Tag> {
let mut tags = Vec::new();
if is_ipv4_address(text) {
tags.push(Tag::IPv4);
}
if is_ipv6_address(text) {
tags.push(Tag::IPv6);
}
// Add other classification logic (URL, Domain, etc.)
tags
}
fn is_ipv4_address(text: &str) -> bool {
// Implementation with regex + validation
}
fn is_ipv6_address(text: &str) -> bool {
// Implementation with regex + std::net::Ipv6Addr parsing
}4. Integration with Scoring System
IP addresses should receive semantic boost per concept.md ranking algorithm:
- IPv4/IPv6 in private ranges: +2 score (internal network indicators)
- IPv4/IPv6 in public ranges: +3 to +5 score (potential C&C, external endpoints)
- IPv4 in special ranges (loopback, multicast): +1 score (informational)
Technical Considerations
-
False Positives:
- Version numbers may look like IPs:
1.2.3.4 - Add context checks: IPs in networking sections get higher confidence
- Consider excluding common version patterns (all octets < 20)
- Version numbers may look like IPs:
-
Performance:
- Use compiled regex with
regexcrate - Consider
aho-corasickfor multi-pattern matching if combined with URL/Domain - Lazy static initialization for regex patterns
- Use compiled regex with
-
Dependencies:
regex = "1.10"(already in use per architecture)- Optional:
ipnetwork = "0.20"or usestd::netfor validation
-
Port Handling:
- Decide if
192.168.1.1:8080should be tagged as IPv4 - Suggest: Strip port suffix before validation, still tag as IPv4
- Decide if
Testing Requirements
Unit Tests
Create src/classification/tests.rs or inline tests with coverage for:
IPv4 Tests:
- ✅ Valid addresses:
192.168.1.1,10.0.0.1,8.8.8.8,1.1.1.1 - ✅ Edge cases:
0.0.0.0,255.255.255.255,127.0.0.1 - ❌ Invalid:
256.1.1.1,192.168.1,192.168.1.1.1,999.999.999.999 - ❌ Leading zeros:
192.168.01.1 - ❌ Version numbers:
1.2.3.4(context-dependent) - ✅ With ports:
192.168.1.1:8080(should extract IP)
IPv6 Tests:
- ✅ Full notation:
2001:0db8:85a3:0000:0000:8a2e:0370:7334 - ✅ Compressed:
2001:db8::1,::1,fe80::1 - ✅ Mixed notation:
::ffff:192.0.2.1,64:ff9b::192.0.2.1 - ✅ All zeros:
:: - ❌ Invalid:
gggg::1,2001:db8::1::2(double::),2001:db8:1 - ✅ With ports/brackets:
[2001:db8::1]:8080
Integration Tests:
- Extract IPs from sample binary strings (mix of text, URLs with IPs, config strings)
- Verify tagging applied correctly to
FoundStringobjects - Test scoring boosts are applied
Documentation
- Add rustdoc comments to classification functions
- Update
concept.mdwith IP classification details - Add examples to
README.mdshowing IP detection
Acceptance Criteria
- IPv4 pattern matching implemented with validation
- IPv6 pattern matching implemented with validation
- Unit tests cover valid, invalid, and edge cases for both formats
- Integration with semantic tagging system (
Tag::IPv4,Tag::IPv6) - False positive mitigation strategies implemented
- Scoring boost integrated into ranking algorithm
- Documentation updated (rustdoc, README, concept.md)
- No performance regression in string extraction pipeline
References
- RFC 791 - Internet Protocol (IPv4)
- RFC 4291 - IPv6 Addressing Architecture
- RFC 5952 - IPv6 Address Text Representation
- Rust
std::net::IpAddrdocumentation
Related Issues
- Implement URL and Domain Pattern Matching in Semantic Classification System #15 (or relevant): URL and Domain Classification (blocking)
- Future: Network endpoint clustering/analysis
Task-ID: stringy-analyzer/ip-address-classification
Requirements: 3.3
Estimated Effort: 2-3 days (implementation + comprehensive testing)