Skip to content

Fix UTF-8 truncation for CJK/emoji characters#444

Open
georg wants to merge 1 commit intomainfrom
fix/utf8-safe-string-truncation
Open

Fix UTF-8 truncation for CJK/emoji characters#444
georg wants to merge 1 commit intomainfrom
fix/utf8-safe-string-truncation

Conversation

@georg
Copy link
Contributor

@georg georg commented Feb 20, 2026

Summary

  • Fix byte-based string slicing that produced garbled text (mojibake) when truncating prompts containing CJK characters or emoji
  • Replace byte slicing in generateContextFromPrompts and TruncateDescription with rune-based truncation using the existing stringutil.TruncateRunes helper
  • Add tests for CJK, emoji, and ASCII truncation scenarios

Closes #419

Test plan

  • New tests verify CJK, emoji, and ASCII truncation all produce valid UTF-8
  • Existing TestTruncateDescription and TestFormatIncrementalMessage tests pass unchanged
  • mise run fmt && mise run lint — no new issues
  • go test -race ./cmd/entire/cli/strategy/ — all tests pass

🤖 Generated with Claude Code

Byte-based string slicing in generateContextFromPrompts and
TruncateDescription split multi-byte UTF-8 sequences at arbitrary
byte boundaries, producing mojibake in context.md and commit messages.

Replace byte slicing with rune-based truncation using the existing
stringutil.TruncateRunes helper throughout.

Closes #419

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 3d5fd4d46a16
@georg georg requested a review from a team as a code owner February 20, 2026 11:43
Copilot AI review requested due to automatic review settings February 20, 2026 11:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes UTF-8 corruption (mojibake) caused by byte-based truncation when generating human-readable prompt/context summaries, by switching truncation to rune-safe logic and adding regression tests (CJK/emoji/ASCII). This aligns with the CLI’s goal of producing readable, searchable session metadata without mangling user content.

Changes:

  • Replace byte slicing with rune-based truncation in generateContextFromPrompts.
  • Update TruncateDescription to truncate by runes (UTF-8 safe) via stringutil.TruncateRunes.
  • Add tests covering CJK, emoji, ASCII truncation and UTF-8 validity.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
cmd/entire/cli/strategy/messages.go Switch description truncation to rune-safe truncation helper to prevent UTF-8 splitting.
cmd/entire/cli/strategy/manual_commit_condensation.go Use rune-based truncation for prompt rendering in generated context.md output.
cmd/entire/cli/strategy/manual_commit_condensation_test.go Add regression tests ensuring truncation remains valid UTF-8 for CJK/emoji/ASCII.

Comment on lines +18 to 22
runes := []rune(s)
if len(runes) <= maxLen {
return s
}
if maxLen < 3 {
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TruncateDescription converts s to []rune up front, but the truncated-path then calls stringutil.TruncateRunes, which converts to []rune again. This adds extra allocations and work on every call (including the non-truncation case). Consider delegating to stringutil.TruncateRunes directly (e.g., use an empty suffix when maxLen < 3) and/or using utf8.RuneCountInString to avoid allocating unless truncation is required.

Copilot uses AI. Check for mistakes.
Comment on lines +683 to +685
// Truncate very long prompts for readability.
// Use rune-based truncation to avoid splitting multi-byte UTF-8 characters (e.g. CJK).
displayPrompt := stringutil.TruncateRunes(prompt, 500, "...")
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stringutil.TruncateRunes always allocates a full []rune for prompt, even when it doesn't actually need truncation. If prompts can be large, consider first checking utf8.RuneCountInString(prompt) > 500 (or similar) and only calling TruncateRunes for long prompts to avoid unnecessary allocations.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

context.md contains garbled text (mojibake) when prompts include CJK or other multi-byte characters

2 participants

Comments