-
-
Notifications
You must be signed in to change notification settings - Fork 0
Labels
area:analyzerBinary analyzer functionalityBinary analyzer functionalitylang:rustRust implementationRust implementationneeds:testsNeeds test coverageNeeds test coveragestatus:backlogTask in backlogTask in backlogstory-points: 55 story points5 story pointstype:enhancementNew feature or requestNew feature or request
Milestone
Description
Problem Statement
Currently, the string extraction pipeline generates multiple FoundString entries for identical string text that appears in different locations within a binary. This creates significant noise in the output and makes analysis difficult.
Example scenario:
- The string "https://example.com" appears in:
.rodatasection at offset 0x1000.datasection at offset 0x3500- PE resource section at offset 0x8000
Without deduplication, three separate FoundString entries are emitted, even though they represent the same semantic content.
Goals
Implement intelligent string deduplication that:
- Reduces noise by consolidating duplicate string text into canonical entries
- Preserves metadata from all occurrences (offsets, sections, sources)
- Merges tags intelligently to combine semantic classifications
- Calculates combined relevance scores based on occurrence patterns
- Maintains traceability so analysts can see all locations where a string appears
Proposed Solution
Implementation Plan
Create src/extraction/dedup.rs with the following components:
1. Canonical String Structure
pub struct CanonicalString {
pub text: String,
pub encoding: Encoding,
pub occurrences: Vec<StringOccurrence>,
pub merged_tags: Vec<Tag>,
pub combined_score: i32,
}
pub struct StringOccurrence {
pub offset: u64,
pub rva: Option<u64>,
pub section: Option<String>,
pub source: StringSource,
pub original_tags: Vec<Tag>,
pub original_score: i32,
}
2. Deduplication Algorithm
pub fn deduplicate(strings: Vec<FoundString>) -> Vec<CanonicalString> {
// 1. Group strings by (text, encoding) key
// 2. For each group, create CanonicalString with:
// - All occurrences preserved
// - Union of all tags across occurrences
// - Combined score = max(scores) + occurrence_bonus
// 3. Sort by combined_score descending
}
3. Scoring Strategy
- Base score: Take the maximum score from all occurrences
- Occurrence bonus: +5 points for each additional occurrence (indicates importance)
- Cross-section bonus: +10 if string appears in multiple section types
- Multi-source bonus: +15 if string appears from different sources (e.g., both section data and imports)
4. Integration Points
Update the extraction pipeline in the architecture:
String Extraction → **Deduplication** → Classification → Ranking → Output
Output Format Considerations
The deduplicated output should show:
- Primary metadata from the "best" occurrence (highest original score)
- Count of total occurrences
- Optional verbose mode to list all locations
JSONL format:
{
"text": "https://example.com",
"encoding": "Utf8",
"occurrences": 3,
"locations": [
{"offset": 4096, "section": ".rodata", "source": "SectionData"},
{"offset": 13568, "section": ".data", "source": "SectionData"},
{"offset": 32768, "section": ".rsrc", "source": "ResourceString"}
],
"tags": ["Url"],
"score": 95
}
Acceptance Criteria
-
src/extraction/dedup.rsmodule created with deduplication logic -
deduplicate()function handles identical text with different encodings separately - All occurrence metadata is preserved in the canonical structure
- Tags are merged intelligently (union, with duplicates removed)
- Combined scoring algorithm implemented with occurrence bonuses
- Unit tests covering:
- Basic deduplication of identical strings
- Metadata preservation from multiple occurrences
- Tag merging logic
- Score calculation with bonuses
- Edge cases (empty strings, single occurrence, etc.)
- Integration test with real binary containing duplicate strings
- Documentation updated in
docs/src/architecture.md
Technical Considerations
Edge Cases
- Empty strings: Should they be deduplicated or filtered?
- Whitespace variations: Are " test" and "test " the same?
- Encoding normalization: How to handle UTF-8 vs UTF-16 of same text?
Performance
- Use
HashMap<(String, Encoding), Vec<FoundString>>for efficient grouping - Consider memory implications for binaries with thousands of duplicate strings
- Benchmark against large binaries (>100MB) to ensure acceptable performance
Configuration
Consider adding configuration options:
--no-dedup: Skip deduplication for raw output--dedup-threshold: Only deduplicate strings appearing N+ times--dedup-mode: [strict|normalized] for handling whitespace
Related Files
src/types.rs: ContainsFoundStringstructure definitionsrc/extraction/mod.rs: Main extraction module (currently placeholder)docs/src/architecture.md: Architecture documentation to update
Task Context
Requirements: 2.5
Task-ID: stringy-analyzer/string-deduplication
Metadata
Metadata
Assignees
Labels
area:analyzerBinary analyzer functionalityBinary analyzer functionalitylang:rustRust implementationRust implementationneeds:testsNeeds test coverageNeeds test coveragestatus:backlogTask in backlogTask in backlogstory-points: 55 story points5 story pointstype:enhancementNew feature or requestNew feature or request