Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the isBlankChar method in the Utils class to detect a broader range of format control characters and adds comprehensive test coverage for the method.
Key Changes:
- Generalizes the blank character detection from checking a single specific character (U+202A) to checking all FORMAT category characters using
Character.getType(c) == Character.FORMAT - Adds new test file
UtilsTest.javawith comprehensive test cases covering various whitespace and format control characters, including Unicode characters commonly found in Word documents
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| common/src/main/java/io/mybatis/common/util/Utils.java | Replaces hardcoded check for U+202A with a generalized check for all FORMAT characters to improve detection of format control characters |
| common/src/test/java/io/mybatis/common/util/UtilsTest.java | Adds new test class with comprehensive test coverage for isBlankChar method, including common whitespace characters and special Unicode format characters |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,36 @@ | |||
| package io.mybatis.common.util; | |||
There was a problem hiding this comment.
This file is missing the Apache License header that is present in other test files in this project (e.g., I18nTest.java). For consistency, please add the standard license header at the beginning of this file.
| || Character.isSpaceChar(c) | ||
| || c == '\ufeff' | ||
| || c == '\u202a'; | ||
| || Character.getType(c) == Character.FORMAT; |
There was a problem hiding this comment.
The change from checking a specific character (c == '\u202a') to checking all FORMAT characters (Character.getType(c) == Character.FORMAT) significantly broadens the scope. The Character.FORMAT category includes directional formatting marks (U+202A-U+202E, U+061C) and other format control characters that control text rendering but don't represent whitespace or blank space.
According to the Javadoc comment above this method (lines 83-84), blank characters should include "空格、制表符、全角空格和不间断空格" (space, tab, full-width space, and non-breaking space). Directional formatting marks like U+202A (LEFT-TO-RIGHT EMBEDDING) don't semantically represent blank space - they control text direction.
Consider whether all FORMAT characters should truly be treated as "blank," or if only specific zero-width space characters (like U+200B) should be added to the existing check. If the intent is to detect all invisible characters, the documentation should be updated to reflect this broader definition.
| @@ -93,7 +93,7 @@ public static boolean isBlankChar(int c) { | |||
| return Character.isWhitespace(c) | |||
| || Character.isSpaceChar(c) | |||
| || c == '\ufeff' | |||
There was a problem hiding this comment.
The explicit check for '\ufeff' on line 95 is now redundant since U+FEFF (Zero Width No-Break Space / BOM) is a FORMAT character and will be matched by Character.getType(c) == Character.FORMAT on line 96. Consider removing the explicit check on line 95 to avoid duplication.
| || c == '\ufeff' |
来自外部数据源(如Word文档)的字符串偶然会出现 \u200B 等特殊字符,应当视为空白符处理。因此优化了 Utils 里面的 isBlankChar 方法的判断逻辑。详见源码及单元测试用例。