OCR Support #4

Lortunate · 2026-01-17T12:39:19Z

No description provided.

gemini-code-assist · 2026-01-17T12:39:43Z

Summary of Changes

Hello @Lortunate, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Optical Character Recognition (OCR) functionality into the project by adding a new 'ocr' plugin. The plugin leverages ONNX Runtime for efficient and accurate text detection and recognition from images. It includes robust model management for downloading and caching necessary ONNX models and character dictionaries, ensuring a streamlined setup for users. The core logic encompasses image preprocessing, inference with specialized detection and recognition models, and post-processing to extract recognized text along with their bounding box locations and confidence scores.

Highlights

New OCR Plugin: Added a new 'ocr' plugin to the project workspace, introducing Optical Character Recognition capabilities.
Text Detection Implementation: Implemented text detection using an ONNX model, including image preprocessing and bounding box extraction.
Text Recognition Implementation: Developed text recognition functionality with an ONNX model, utilizing CTC Greedy Decode for converting model outputs to text.
Model Management: Introduced a 'ModelManager' to handle automatic downloading and caching of OCR models (detection, recognition, and character keys).
Example Usage: Provided an example demonstrating how to initialize the OCR context, download models, and perform text recognition on an image.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces OCR support as a new plugin. While the implementation is generally well-structured, a security audit identified two significant vulnerabilities: a high-severity Server-Side Request Forgery (SSRF) vulnerability due to improper validation of user-provided URLs for model downloads, and a medium-severity Time-of-check Time-of-use (TOCTOU) vulnerability in the model downloading logic that could allow local attackers to overwrite arbitrary files. Additionally, there are suggestions for improving API design, performance, and robustness.

plugins/ocr/src/lib.rs

plugins/ocr/src/utils.rs

plugins/ocr/src/model_manager.rs

plugins/ocr/Cargo.toml

plugins/ocr/examples/simple_ocr.rs

plugins/ocr/src/engine.rs

plugins/ocr/src/lib.rs

plugins/ocr/src/recognizer.rs

plugins/ocr/src/utils.rs

- Fix crash by removing static Clipboard instance in src/core/io/clipboard.rs - Centralize clipboard logic and remove direct arboard dependency in bridge - Apply cargo fmt and clippy fixes (collapsible ifs, casts, string stripping) - Refactor hotkey registration return type for clarity - Remove all comments as requested

Lortunate · 2026-01-27T07:49:42Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces Optical Character Recognition (OCR) support, which is a significant new feature. The implementation is well-structured, leveraging Rust's concurrency features with tokio and rayon for efficient model downloading and text processing. The CXX-Qt bridge seamlessly integrates the OCR functionality into the QML frontend, providing a user-friendly interface for enabling OCR, managing models, and interacting with recognized text. The addition of an OCR settings page and toolbar button enhances usability. Overall, this is a robust and well-thought-out addition to the project.

gemini-code-assist · 2026-01-27T07:55:52Z

plugins/ocr/src/utils.rs

+        })
+        .collect();
+
+    let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0).div_ceil(32).max(1) * 32;


The div_ceil method is currently a nightly-only experimental API. Using it might prevent the code from compiling on stable Rust. Consider using an alternative implementation for ceiling division, such as (a + b - 1) / b for positive integers a and b.

Suggested change

let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0).div_ceil(32).max(1) * 32;

let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0);

let max_width = (max_width + 31) / 32 * 32;

let max_width = max_width.max(1);

gemini-code-assist · 2026-01-27T07:55:52Z

src/core/io/clipboard.rs

 pub fn copy_image_to_clipboard(image: &RgbaImage) -> bool {
    info!("Copying image ({}x{}) to clipboard...", image.width(), image.height());

-    let mut guard = CLIPBOARD.lock().unwrap();
-
-    if guard.is_none() {
-        match Clipboard::new() {
-            Ok(c) => *guard = Some(c),
-            Err(e) => {
-                error!("Failed to initialize clipboard: {}", e);
-                return false;
-            }
+    let mut clipboard = match Clipboard::new() {
+        Ok(c) => c,
+        Err(e) => {
+            error!("Failed to initialize clipboard: {}", e);
+            return false;
        }
+    };


The previous implementation used a LazyLock<Mutex<Option<Clipboard>>> to manage the clipboard instance, which could potentially avoid repeated initialization overhead. Creating a new Clipboard instance for every copy operation might introduce a slight performance penalty, especially if these functions are called frequently. Was the change to remove the LazyLock intentional to address specific issues, or was it an oversight?

gemini-code-assist · 2026-01-27T07:55:52Z

src/bridge/ocr_overlay.rs

+                            let mut min_x = i32::MAX;
+                            let mut min_y = i32::MAX;
+                            let mut max_x = i32::MIN;
+                            let mut max_y = i32::MIN;
+                            for (x, y) in pts {
+                                if *x < min_x {
+                                    min_x = *x;
+                                }
+                                if *x > max_x {
+                                    max_x = *x;
+                                }
+                                if *y < min_y {
+                                    min_y = *y;
+                                }
+                                if *y > max_y {
+                                    max_y = *y;
+                                }
+                            }
+                            let x = min_x as f64;
+                            let y = min_y as f64;
+                            let w = (max_x - min_x) as f64;
+                            let h = (max_y - min_y) as f64;
+
+                            return OcrBlock {
+                                text: res.text,
+                                cx: (x + w / 2.0) / img_w,
+                                cy: (y + h / 2.0) / img_h,
+                                width: w / img_w,
+                                height: h / img_h,
+                                angle: 0.0,
+                                percentage_coordinates: true,
+                            };


The current fallback logic for OcrBlock creation when pts.len() != 4 assumes an axis-aligned bounding box by calculating min_x, min_y, max_x, max_y. This approach might discard rotational information if the OCR engine provides it in a different format (e.g., a rotated rectangle with 4 points, but not necessarily in a specific order that pts[0], pts[1] etc. would imply a horizontal line). If the OCR engine can return rotated boxes, it would be more accurate to handle those cases to preserve the original text orientation in the overlay.

gemini-code-assist · 2026-01-27T07:55:52Z

src/bridge/ocr_overlay.rs

+                        let angle_rad = dy.atan2(dx);
+                        let angle_deg = angle_rad * 180.0 / PI;


The angle_rad calculation using atan2(dy, dx) correctly determines the angle of the line segment p0 to p1. However, for text recognition, the 'angle' typically refers to the baseline angle of the text. While p0 and p1 often define the top-left and top-right corners of a text box, this might not always accurately represent the text's baseline angle, especially for highly skewed text or if the points are not ordered consistently. Consider if this angle calculation is robust enough for all expected text orientations from the OCR engine.

Add OCR plugin with model handling and recognition pipeline

0ee0c0f

gemini-code-assist bot reviewed Jan 17, 2026

View reviewed changes

Lortunate added 7 commits January 23, 2026 00:20

feat: implement OCR settings and model management

ec825fe

Merge branch 'master' into feat/ocr

c4bc938

feat: add OCR visualization module and enhance bounding box handling

116e6f1

feat: integrate OCR overlay with precise selection

6e09955

Merge remote-tracking branch 'origin/feat/ocr' into feat/ocr

082cb84

fix: optimize OCR overlay interaction

ca1aca6

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

Lortunate merged commit 43ed671 into master Jan 27, 2026
3 checks passed

Lortunate deleted the feat/ocr branch January 28, 2026 11:56

-    let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0).div_ceil(32).max(1) * 32;
+    let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0);
+    let max_width = (max_width + 31) / 32 * 32;
+    let max_width = max_width.max(1);

		let angle_rad = dy.atan2(dx);
		let angle_deg = angle_rad * 180.0 / PI;

OCR Support #4

OCR Support #4

Uh oh!

Conversation

Lortunate commented Jan 17, 2026

Uh oh!

gemini-code-assist bot commented Jan 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lortunate commented Jan 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants