Skip to content

Implement String Deduplication with Metadata Preservation #13

@unclesp1d3r

Description

@unclesp1d3r

Problem Statement

Currently, the string extraction pipeline generates multiple FoundString entries for identical string text that appears in different locations within a binary. This creates significant noise in the output and makes analysis difficult.

Example scenario:

  • The string "https://example.com" appears in:
    • .rodata section at offset 0x1000
    • .data section at offset 0x3500
    • PE resource section at offset 0x8000

Without deduplication, three separate FoundString entries are emitted, even though they represent the same semantic content.

Goals

Implement intelligent string deduplication that:

  1. Reduces noise by consolidating duplicate string text into canonical entries
  2. Preserves metadata from all occurrences (offsets, sections, sources)
  3. Merges tags intelligently to combine semantic classifications
  4. Calculates combined relevance scores based on occurrence patterns
  5. Maintains traceability so analysts can see all locations where a string appears

Proposed Solution

Implementation Plan

Create src/extraction/dedup.rs with the following components:

1. Canonical String Structure

pub struct CanonicalString {
    pub text: String,
    pub encoding: Encoding,
    pub occurrences: Vec<StringOccurrence>,
    pub merged_tags: Vec<Tag>,
    pub combined_score: i32,
}

pub struct StringOccurrence {
    pub offset: u64,
    pub rva: Option<u64>,
    pub section: Option<String>,
    pub source: StringSource,
    pub original_tags: Vec<Tag>,
    pub original_score: i32,
}

2. Deduplication Algorithm

pub fn deduplicate(strings: Vec<FoundString>) -> Vec<CanonicalString> {
    // 1. Group strings by (text, encoding) key
    // 2. For each group, create CanonicalString with:
    //    - All occurrences preserved
    //    - Union of all tags across occurrences
    //    - Combined score = max(scores) + occurrence_bonus
    // 3. Sort by combined_score descending
}

3. Scoring Strategy

  • Base score: Take the maximum score from all occurrences
  • Occurrence bonus: +5 points for each additional occurrence (indicates importance)
  • Cross-section bonus: +10 if string appears in multiple section types
  • Multi-source bonus: +15 if string appears from different sources (e.g., both section data and imports)

4. Integration Points

Update the extraction pipeline in the architecture:

String Extraction → **Deduplication** → Classification → Ranking → Output

Output Format Considerations

The deduplicated output should show:

  • Primary metadata from the "best" occurrence (highest original score)
  • Count of total occurrences
  • Optional verbose mode to list all locations

JSONL format:

{
  "text": "https://example.com",
  "encoding": "Utf8",
  "occurrences": 3,
  "locations": [
    {"offset": 4096, "section": ".rodata", "source": "SectionData"},
    {"offset": 13568, "section": ".data", "source": "SectionData"},
    {"offset": 32768, "section": ".rsrc", "source": "ResourceString"}
  ],
  "tags": ["Url"],
  "score": 95
}

Acceptance Criteria

  • src/extraction/dedup.rs module created with deduplication logic
  • deduplicate() function handles identical text with different encodings separately
  • All occurrence metadata is preserved in the canonical structure
  • Tags are merged intelligently (union, with duplicates removed)
  • Combined scoring algorithm implemented with occurrence bonuses
  • Unit tests covering:
    • Basic deduplication of identical strings
    • Metadata preservation from multiple occurrences
    • Tag merging logic
    • Score calculation with bonuses
    • Edge cases (empty strings, single occurrence, etc.)
  • Integration test with real binary containing duplicate strings
  • Documentation updated in docs/src/architecture.md

Technical Considerations

Edge Cases

  • Empty strings: Should they be deduplicated or filtered?
  • Whitespace variations: Are " test" and "test " the same?
  • Encoding normalization: How to handle UTF-8 vs UTF-16 of same text?

Performance

  • Use HashMap<(String, Encoding), Vec<FoundString>> for efficient grouping
  • Consider memory implications for binaries with thousands of duplicate strings
  • Benchmark against large binaries (>100MB) to ensure acceptable performance

Configuration

Consider adding configuration options:

  • --no-dedup: Skip deduplication for raw output
  • --dedup-threshold: Only deduplicate strings appearing N+ times
  • --dedup-mode: [strict|normalized] for handling whitespace

Related Files

  • src/types.rs: Contains FoundString structure definition
  • src/extraction/mod.rs: Main extraction module (currently placeholder)
  • docs/src/architecture.md: Architecture documentation to update

Task Context

Requirements: 2.5
Task-ID: stringy-analyzer/string-deduplication

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions