Skip to content

Conversation

@Lortunate
Copy link
Owner

No description provided.

@gemini-code-assist
Copy link

Summary of Changes

Hello @Lortunate, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Optical Character Recognition (OCR) functionality into the project by adding a new 'ocr' plugin. The plugin leverages ONNX Runtime for efficient and accurate text detection and recognition from images. It includes robust model management for downloading and caching necessary ONNX models and character dictionaries, ensuring a streamlined setup for users. The core logic encompasses image preprocessing, inference with specialized detection and recognition models, and post-processing to extract recognized text along with their bounding box locations and confidence scores.

Highlights

  • New OCR Plugin: Added a new 'ocr' plugin to the project workspace, introducing Optical Character Recognition capabilities.
  • Text Detection Implementation: Implemented text detection using an ONNX model, including image preprocessing and bounding box extraction.
  • Text Recognition Implementation: Developed text recognition functionality with an ONNX model, utilizing CTC Greedy Decode for converting model outputs to text.
  • Model Management: Introduced a 'ModelManager' to handle automatic downloading and caching of OCR models (detection, recognition, and character keys).
  • Example Usage: Provided an example demonstrating how to initialize the OCR context, download models, and perform text recognition on an image.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces OCR support as a new plugin. While the implementation is generally well-structured, a security audit identified two significant vulnerabilities: a high-severity Server-Side Request Forgery (SSRF) vulnerability due to improper validation of user-provided URLs for model downloads, and a medium-severity Time-of-check Time-of-use (TOCTOU) vulnerability in the model downloading logic that could allow local attackers to overwrite arbitrary files. Additionally, there are suggestions for improving API design, performance, and robustness.

- Fix crash by removing static Clipboard instance in src/core/io/clipboard.rs
- Centralize clipboard logic and remove direct arboard dependency in bridge
- Apply cargo fmt and clippy fixes (collapsible ifs, casts, string stripping)
- Refactor hotkey registration return type for clarity
- Remove all comments as requested
@Lortunate
Copy link
Owner Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Optical Character Recognition (OCR) support, which is a significant new feature. The implementation is well-structured, leveraging Rust's concurrency features with tokio and rayon for efficient model downloading and text processing. The CXX-Qt bridge seamlessly integrates the OCR functionality into the QML frontend, providing a user-friendly interface for enabling OCR, managing models, and interacting with recognized text. The addition of an OCR settings page and toolbar button enhances usability. Overall, this is a robust and well-thought-out addition to the project.

})
.collect();

let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0).div_ceil(32).max(1) * 32;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The div_ceil method is currently a nightly-only experimental API. Using it might prevent the code from compiling on stable Rust. Consider using an alternative implementation for ceiling division, such as (a + b - 1) / b for positive integers a and b.

Suggested change
let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0).div_ceil(32).max(1) * 32;
let max_width = resized_images.iter().map(|(_, w)| *w).max().unwrap_or(0);
let max_width = (max_width + 31) / 32 * 32;
let max_width = max_width.max(1);

Comment on lines 6 to +15
pub fn copy_image_to_clipboard(image: &RgbaImage) -> bool {
info!("Copying image ({}x{}) to clipboard...", image.width(), image.height());

let mut guard = CLIPBOARD.lock().unwrap();

if guard.is_none() {
match Clipboard::new() {
Ok(c) => *guard = Some(c),
Err(e) => {
error!("Failed to initialize clipboard: {}", e);
return false;
}
let mut clipboard = match Clipboard::new() {
Ok(c) => c,
Err(e) => {
error!("Failed to initialize clipboard: {}", e);
return false;
}
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The previous implementation used a LazyLock<Mutex<Option<Clipboard>>> to manage the clipboard instance, which could potentially avoid repeated initialization overhead. Creating a new Clipboard instance for every copy operation might introduce a slight performance penalty, especially if these functions are called frequently. Was the change to remove the LazyLock intentional to address specific issues, or was it an oversight?

Comment on lines +101 to +132
let mut min_x = i32::MAX;
let mut min_y = i32::MAX;
let mut max_x = i32::MIN;
let mut max_y = i32::MIN;
for (x, y) in pts {
if *x < min_x {
min_x = *x;
}
if *x > max_x {
max_x = *x;
}
if *y < min_y {
min_y = *y;
}
if *y > max_y {
max_y = *y;
}
}
let x = min_x as f64;
let y = min_y as f64;
let w = (max_x - min_x) as f64;
let h = (max_y - min_y) as f64;

return OcrBlock {
text: res.text,
cx: (x + w / 2.0) / img_w,
cy: (y + h / 2.0) / img_h,
width: w / img_w,
height: h / img_h,
angle: 0.0,
percentage_coordinates: true,
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current fallback logic for OcrBlock creation when pts.len() != 4 assumes an axis-aligned bounding box by calculating min_x, min_y, max_x, max_y. This approach might discard rotational information if the OCR engine provides it in a different format (e.g., a rotated rectangle with 4 points, but not necessarily in a specific order that pts[0], pts[1] etc. would imply a horizontal line). If the OCR engine can return rotated boxes, it would be more accurate to handle those cases to preserve the original text orientation in the overlay.

Comment on lines +153 to +154
let angle_rad = dy.atan2(dx);
let angle_deg = angle_rad * 180.0 / PI;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The angle_rad calculation using atan2(dy, dx) correctly determines the angle of the line segment p0 to p1. However, for text recognition, the 'angle' typically refers to the baseline angle of the text. While p0 and p1 often define the top-left and top-right corners of a text box, this might not always accurately represent the text's baseline angle, especially for highly skewed text or if the points are not ordered consistently. Consider if this angle calculation is robust enough for all expected text orientations from the OCR engine.

@Lortunate Lortunate merged commit 43ed671 into master Jan 27, 2026
3 checks passed
@Lortunate Lortunate deleted the feat/ocr branch January 28, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants