Skip to content

Implement IPv4 and IPv6 Address Pattern Detection in Semantic Classifier #16

@unclesp1d3r

Description

@unclesp1d3r

Summary

Implement comprehensive IPv4 and IPv6 address pattern detection and validation within the semantic classification system to identify and tag IP addresses found in binary strings.

Context

The semantic classifier currently supports URL and domain detection. IP addresses are a critical type of network indicator that appear frequently in binaries (C&C addresses, configuration endpoints, telemetry servers, hardcoded network targets). Adding IPv4 and IPv6 detection will enable security analysts and reverse engineers to quickly identify potential network indicators of compromise (IOCs).

Current State

  • Tag::IPv4 and Tag::IPv6 enum variants are already defined in src/types.rs (lines 18-19)
  • src/classification/mod.rs exists but is currently empty
  • The semantic tagging infrastructure is in place but needs pattern matching implementation
  • Architecture supports regex-based classification per concept.md

Dependencies

Proposed Solution

Implementation Approach

Implement IP address detection in src/classification/mod.rs (or a dedicated submodule) with the following components:

1. IPv4 Pattern Matching

Pattern: XXX.XXX.XXX.XXX where each octet is 0-255

Validation Rules:

  • Four octets separated by dots
  • Each octet must be 0-255 (no leading zeros except for "0" itself)
  • No leading/trailing dots
  • Exclude invalid ranges for context (e.g., 0.0.0.0, 255.255.255.255 may be flagged with lower scores)

Example Valid: 192.168.1.1, 10.0.0.1, 172.16.0.1, 8.8.8.8

Example Invalid: 256.1.1.1, 192.168.1, 192.168.1.1.1, 192.168.01.1 (leading zero)

Regex Pattern (starting point):

\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

2. IPv6 Pattern Matching

Format Support:

  • Full notation: 2001:0db8:85a3:0000:0000:8a2e:0370:7334
  • Compressed notation: 2001:db8:85a3::8a2e:370:7334
  • Mixed notation (IPv4-mapped): ::ffff:192.0.2.1
  • Loopback: ::1
  • Link-local: fe80::1

Validation Rules:

  • Eight groups of 4 hexadecimal digits separated by colons
  • Double colon :: allowed once to represent consecutive zeros
  • Trailing/embedded IPv4 addresses in mixed notation
  • Case-insensitive hex digits (a-f, A-F)

Example Valid:

  • 2001:db8::1
  • fe80::1
  • ::1
  • 2001:0db8:85a3::8a2e:0370:7334
  • ::ffff:192.0.2.1

Implementation Note: IPv6 regex validation is complex. Consider using the ipnetwork or std::net::IpAddr for validation after initial pattern matching.

3. Classification Function Structure

pub fn classify_string(text: &str) -> Vec<Tag> {
    let mut tags = Vec::new();
    
    if is_ipv4_address(text) {
        tags.push(Tag::IPv4);
    }
    
    if is_ipv6_address(text) {
        tags.push(Tag::IPv6);
    }
    
    // Add other classification logic (URL, Domain, etc.)
    
    tags
}

fn is_ipv4_address(text: &str) -> bool {
    // Implementation with regex + validation
}

fn is_ipv6_address(text: &str) -> bool {
    // Implementation with regex + std::net::Ipv6Addr parsing
}

4. Integration with Scoring System

IP addresses should receive semantic boost per concept.md ranking algorithm:

  • IPv4/IPv6 in private ranges: +2 score (internal network indicators)
  • IPv4/IPv6 in public ranges: +3 to +5 score (potential C&C, external endpoints)
  • IPv4 in special ranges (loopback, multicast): +1 score (informational)

Technical Considerations

  1. False Positives:

    • Version numbers may look like IPs: 1.2.3.4
    • Add context checks: IPs in networking sections get higher confidence
    • Consider excluding common version patterns (all octets < 20)
  2. Performance:

    • Use compiled regex with regex crate
    • Consider aho-corasick for multi-pattern matching if combined with URL/Domain
    • Lazy static initialization for regex patterns
  3. Dependencies:

    • regex = "1.10" (already in use per architecture)
    • Optional: ipnetwork = "0.20" or use std::net for validation
  4. Port Handling:

    • Decide if 192.168.1.1:8080 should be tagged as IPv4
    • Suggest: Strip port suffix before validation, still tag as IPv4

Testing Requirements

Unit Tests

Create src/classification/tests.rs or inline tests with coverage for:

IPv4 Tests:

  • ✅ Valid addresses: 192.168.1.1, 10.0.0.1, 8.8.8.8, 1.1.1.1
  • ✅ Edge cases: 0.0.0.0, 255.255.255.255, 127.0.0.1
  • ❌ Invalid: 256.1.1.1, 192.168.1, 192.168.1.1.1, 999.999.999.999
  • ❌ Leading zeros: 192.168.01.1
  • ❌ Version numbers: 1.2.3.4 (context-dependent)
  • ✅ With ports: 192.168.1.1:8080 (should extract IP)

IPv6 Tests:

  • ✅ Full notation: 2001:0db8:85a3:0000:0000:8a2e:0370:7334
  • ✅ Compressed: 2001:db8::1, ::1, fe80::1
  • ✅ Mixed notation: ::ffff:192.0.2.1, 64:ff9b::192.0.2.1
  • ✅ All zeros: ::
  • ❌ Invalid: gggg::1, 2001:db8::1::2 (double ::), 2001:db8:1
  • ✅ With ports/brackets: [2001:db8::1]:8080

Integration Tests:

  • Extract IPs from sample binary strings (mix of text, URLs with IPs, config strings)
  • Verify tagging applied correctly to FoundString objects
  • Test scoring boosts are applied

Documentation

  • Add rustdoc comments to classification functions
  • Update concept.md with IP classification details
  • Add examples to README.md showing IP detection

Acceptance Criteria

  • IPv4 pattern matching implemented with validation
  • IPv6 pattern matching implemented with validation
  • Unit tests cover valid, invalid, and edge cases for both formats
  • Integration with semantic tagging system (Tag::IPv4, Tag::IPv6)
  • False positive mitigation strategies implemented
  • Scoring boost integrated into ranking algorithm
  • Documentation updated (rustdoc, README, concept.md)
  • No performance regression in string extraction pipeline

References

Related Issues


Task-ID: stringy-analyzer/ip-address-classification
Requirements: 3.3
Estimated Effort: 2-3 days (implementation + comprehensive testing)

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions