-
Notifications
You must be signed in to change notification settings - Fork 0
OCR Support #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR Support #4
Conversation
Summary of ChangesHello @Lortunate, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates Optical Character Recognition (OCR) functionality into the project by adding a new 'ocr' plugin. The plugin leverages ONNX Runtime for efficient and accurate text detection and recognition from images. It includes robust model management for downloading and caching necessary ONNX models and character dictionaries, ensuring a streamlined setup for users. The core logic encompasses image preprocessing, inference with specialized detection and recognition models, and post-processing to extract recognized text along with their bounding box locations and confidence scores. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces OCR support as a new plugin. While the implementation is generally well-structured, a security audit identified two significant vulnerabilities: a high-severity Server-Side Request Forgery (SSRF) vulnerability due to improper validation of user-provided URLs for model downloads, and a medium-severity Time-of-check Time-of-use (TOCTOU) vulnerability in the model downloading logic that could allow local attackers to overwrite arbitrary files. Additionally, there are suggestions for improving API design, performance, and robustness.
- Fix crash by removing static Clipboard instance in src/core/io/clipboard.rs - Centralize clipboard logic and remove direct arboard dependency in bridge - Apply cargo fmt and clippy fixes (collapsible ifs, casts, string stripping) - Refactor hotkey registration return type for clarity - Remove all comments as requested
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces Optical Character Recognition (OCR) support, which is a significant new feature. The implementation is well-structured, leveraging Rust's concurrency features with tokio and rayon for efficient model downloading and text processing. The CXX-Qt bridge seamlessly integrates the OCR functionality into the QML frontend, providing a user-friendly interface for enabling OCR, managing models, and interacting with recognized text. The addition of an OCR settings page and toolbar button enhances usability. Overall, this is a robust and well-thought-out addition to the project.
| }) | ||
| .collect(); | ||
|
|
||
| let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0).div_ceil(32).max(1) * 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The div_ceil method is currently a nightly-only experimental API. Using it might prevent the code from compiling on stable Rust. Consider using an alternative implementation for ceiling division, such as (a + b - 1) / b for positive integers a and b.
| let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0).div_ceil(32).max(1) * 32; | |
| let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0); | |
| let max_width = (max_width + 31) / 32 * 32; | |
| let max_width = max_width.max(1); |
| pub fn copy_image_to_clipboard(image: &RgbaImage) -> bool { | ||
| info!("Copying image ({}x{}) to clipboard...", image.width(), image.height()); | ||
|
|
||
| let mut guard = CLIPBOARD.lock().unwrap(); | ||
|
|
||
| if guard.is_none() { | ||
| match Clipboard::new() { | ||
| Ok(c) => *guard = Some(c), | ||
| Err(e) => { | ||
| error!("Failed to initialize clipboard: {}", e); | ||
| return false; | ||
| } | ||
| let mut clipboard = match Clipboard::new() { | ||
| Ok(c) => c, | ||
| Err(e) => { | ||
| error!("Failed to initialize clipboard: {}", e); | ||
| return false; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous implementation used a LazyLock<Mutex<Option<Clipboard>>> to manage the clipboard instance, which could potentially avoid repeated initialization overhead. Creating a new Clipboard instance for every copy operation might introduce a slight performance penalty, especially if these functions are called frequently. Was the change to remove the LazyLock intentional to address specific issues, or was it an oversight?
| let mut min_x = i32::MAX; | ||
| let mut min_y = i32::MAX; | ||
| let mut max_x = i32::MIN; | ||
| let mut max_y = i32::MIN; | ||
| for (x, y) in pts { | ||
| if *x < min_x { | ||
| min_x = *x; | ||
| } | ||
| if *x > max_x { | ||
| max_x = *x; | ||
| } | ||
| if *y < min_y { | ||
| min_y = *y; | ||
| } | ||
| if *y > max_y { | ||
| max_y = *y; | ||
| } | ||
| } | ||
| let x = min_x as f64; | ||
| let y = min_y as f64; | ||
| let w = (max_x - min_x) as f64; | ||
| let h = (max_y - min_y) as f64; | ||
|
|
||
| return OcrBlock { | ||
| text: res.text, | ||
| cx: (x + w / 2.0) / img_w, | ||
| cy: (y + h / 2.0) / img_h, | ||
| width: w / img_w, | ||
| height: h / img_h, | ||
| angle: 0.0, | ||
| percentage_coordinates: true, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current fallback logic for OcrBlock creation when pts.len() != 4 assumes an axis-aligned bounding box by calculating min_x, min_y, max_x, max_y. This approach might discard rotational information if the OCR engine provides it in a different format (e.g., a rotated rectangle with 4 points, but not necessarily in a specific order that pts[0], pts[1] etc. would imply a horizontal line). If the OCR engine can return rotated boxes, it would be more accurate to handle those cases to preserve the original text orientation in the overlay.
| let angle_rad = dy.atan2(dx); | ||
| let angle_deg = angle_rad * 180.0 / PI; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The angle_rad calculation using atan2(dy, dx) correctly determines the angle of the line segment p0 to p1. However, for text recognition, the 'angle' typically refers to the baseline angle of the text. While p0 and p1 often define the top-left and top-right corners of a text box, this might not always accurately represent the text's baseline angle, especially for highly skewed text or if the points are not ordered consistently. Consider if this angle calculation is robust enough for all expected text orientations from the OCR engine.
No description provided.