Skip to content

Comments

add polars support.#5

Open
aryehlev wants to merge 20 commits intomainfrom
polars
Open

add polars support.#5
aryehlev wants to merge 20 commits intomainfrom
polars

Conversation

@aryehlev
Copy link
Owner

@aryehlev aryehlev commented Nov 17, 2025

Summary by CodeRabbit

  • New Features

    • Polars DataFrame support: predict directly from DataFrames.
    • Cross-platform CI: automated workflows testing multiple OSes, architectures, and XGBoost versions.
  • Documentation

    • Updated Windows guidance: automatic downloads not supported; added build/use alternatives and removed Windows from cross-platform list.
  • Chores

    • Polars added as an optional dependency/feature.
    • Build and packaging improvements: better platform detection, checksum validation, caching and extraction diagnostics.

@coderabbitai
Copy link

coderabbitai bot commented Nov 17, 2025

Walkthrough

Adds a GitHub Actions CI workflow, an optional Polars integration (DataFrame prediction), large build.rs refactor for platform/URL/filename handling, README Windows guidance updates, minor formatting/linting tweaks, and new/modified example and core files.

Changes

Cohort / File(s) Summary
CI / Workflow
.github/workflows/ci.yml
New GitHub Actions workflow "CI" with matrix testing across macOS/Linux, architectures, and XGBoost versions; jobs for test, clippy, fmt, checksum validation, and caching tests; platform-specific install and verification steps.
Polars feature
Cargo.toml, src/lib.rs, src/polars_ext.rs
Adds optional polars dependency and feature; new polars_ext module exposing BoosterPolarsExt with predict_dataframe and predict_dataframe_with_columns, plus dataframe_to_dense helper converting DataFrame to row-major Vec<f32>.
Build system refactor
build.rs
Major refactor: introduces OS/arch helper functions, get_lib_filename, revised wheel filename logic and checksum handling, improved HTTP status checks and extraction/debug diagnostics, emits cfgs via cargo, and replaces string-based platform checks with #[cfg(...)] guarded code.
Documentation
README.md
Removes Windows from cross-platform badge/listing and adds Windows-specific guidance: no automatic download, missing MSVC import libs in wheels, recommend building from source or using WSL/MinGW.
Examples
examples/advanced_usage.rs, examples/basic_usage.rs
Formatting and minor code-flow tweaks: inlined vectors and reflowed code; advanced_usage maps file IO errors to descriptive XGBoostError; basic_usage comment alignment only.
Core / Linting / Small fixes
src/lib.rs, src/sys.rs, src/error.rs, src/model.rs
lib.rs exposes BoosterPolarsExt under polars feature; src/sys.rs adds #![allow(dead_code)]; src/error.rs deduplicates crate::sys import; src/model.rs minor formatting/refactorings without behaviour changes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant BoosterPolarsExt
    participant dataframe_to_dense as df_to_dense
    participant ColumnExtract
    participant Booster_predict as Booster.predict

    User->>BoosterPolarsExt: predict_dataframe(df, option_mask, training)
    BoosterPolarsExt->>df_to_dense: validate & convert df
    df_to_dense->>ColumnExtract: for each column value -> cast to Float32
    ColumnExtract-->>df_to_dense: f32 value (or error)
    df_to_dense-->>BoosterPolarsExt: (Vec<f32>, rows, cols)
    BoosterPolarsExt->>Booster_predict: predict(dense, rows, cols, option_mask, training)
    Booster_predict-->>BoosterPolarsExt: Vec<f32> predictions
    BoosterPolarsExt-->>User: XGBoostResult<Vec<f32>>
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Review focus:
    • build.rs — platform detection, filename/wheel/tag logic, checksum and extraction error paths, and cfg emission.
    • src/polars_ext.rs — correctness of DataFrame->dense conversion, casting, null handling, error messages.
    • CI workflow — matrix, caching steps, and platform-specific install commands.

Possibly related PRs

Poem

🐰
I hopped through CI where matrices play,
Polars now help DataFrames sway,
Build scripts tidy, checksums in hand,
Windows waits while mac/linux stand—
A carrot for tests, then off I stray.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add polars support' directly and accurately captures the main change: introducing Polars integration with a new optional feature, module, and extension trait for DataFrame predictions.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch polars

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (9)
src/sys.rs (1)

4-4: dead_code allowance is reasonable for generated FFI

Allowing dead_code for this bindings module makes sense given bindgen tends to generate many unused items. If you ever add hand-written logic here, consider narrowing the scope (e.g., around the include!) so accidental unused code elsewhere still gets caught.

.github/workflows/ci.yml (2)

35-40: Double-check macos-15-intel runner availability

The Intel macOS job uses runs-on: macos-15-intel. Runner labels evolve over time, so it’s worth confirming that this label exists and is available for your repository; otherwise the job will fail to schedule.

If needed, switch to a documented Intel macOS label (or self-hosted runner) instead.

Please confirm against GitHub Actions’ latest runner matrix that macos-15-intel is a valid label, and adjust if necessary.


151-157: Thread-safety message check only asserts presence, not semantics

The “Verify thread safety detection” step greps for the substring "thread-safe", which matches both the positive and negative messages emitted in build.rs. This confirms that some thread-safety message was printed but doesn’t distinguish “supports thread-safe predictions” from “does NOT support thread-safe predictions”.

If you ever want the CI to enforce the boundary behavior more strictly (e.g., success only when the positive message is present for ≥1.4 and absent below), consider tightening the grep patterns or adding version-conditional expectations.

build.rs (4)

108-128: OS/arch detection in build script is host-based and may surprise cross-compiles

get_arch_from_target() / get_os_name() rely on #[cfg(target_arch = ...)] / #[cfg(target_os = ...)] inside the build script. For Cargo, build scripts are compiled for and run on the host by default, so these cfgs typically reflect the host, not a custom --target triple. That means the wheel and library name selection is host-driven, and cross-compiling to a different target won’t be honored here.

Given the current design (downloading wheels for the local platform) that’s probably acceptable, but if you ever want to support cross-compilation it would be safer to derive OS/arch from CARGO_CFG_TARGET_OS / CARGO_CFG_TARGET_ARCH (or fail fast with a clear error when they’re unsupported) instead of compile-time cfg alone.

Please confirm in the Rust/Cargo docs how cfg(target_arch) behaves for build scripts vs the TARGET triple and adjust if cross-compilation needs to be supported.

Also applies to: 130-145


236-278: Wheel download URL construction is brittle; consider deriving URLs from PyPI metadata

download_with_retry itself is fine, but the URL you pass into it is built as:

let download_url = format!(
    "https://files.pythonhosted.org/packages/py3/x/xgboost/{}",
    wheel_filename
);

PyPI’s direct wheel URLs for many packages (including xgboost-cpu and other common libs) generally live under hashed packages/<hash>/.../name-version-...whl paths rather than a stable packages/py3/x/<name>/... layout. (huggingface.co) Relying on a hard-coded subpath risks 404s if that alias doesn’t exist or ever changes, even if the wheel_filename itself is correct. (pypi.org)

Since you already hard-code the exact wheel filenames per OS/arch/version, a more robust approach would be to:

  • Query https://pypi.org/pypi/xgboost/json (or xgboost-cpu if you prefer) and pick the matching file’s url, or
  • Delegate to pip download in a temp venv and read the wheel from disk.

That way, changes in PyPI’s internal storage paths won’t break your build script as long as the wheel name remains stable.

Please verify that the current download_url actually resolves for all versions you exercise in CI (1.4.2, 1.7.6, 2.1.4, 3.0.5, 3.1.1) and consider switching to a metadata-driven URL if not.

Also applies to: 362-384


280-332: Version/OS/arch-specific wheel naming logic is well thought out but tightly coupled to current PyPI naming

The match (os, arch) block and manylinux/macosx tag selection encode detailed knowledge of how XGBoost’s wheels are currently named across versions and platforms (e.g., manylinux_2_28 for 3.x, manylinux2010 for 1.4.x, multi-platform macOS tags for some 1.x/2.x releases). Combined with the improved diagnostics (searched_paths and the detailed error if the expected lib_filename isn’t found), this should make failures easier to debug.

The flip side is that this is quite policy-heavy; any future naming change (or introduction of new platforms/variants) will require touching this logic. That may be acceptable given you already gate on a finite list of supported versions, but it’s worth keeping in mind as you add new XGBoost versions or platforms.

Also applies to: 389-445


492-507: Library copy + rpath/soname handling looks consistent across platforms, but Windows remains effectively unsupported

  • Using get_lib_filename() for both the copy step and the macOS/Linux install_name_tool/patchelf calls keeps the library name consistent and avoids string duplication.
  • The macOS and Linux rpath blocks give executables multiple ways to find the library (loader path, executable path, OUT_DIR libs, and target/{debug,release}), which should ease local development and IDE integration.

Given the README explicitly says Windows auto-download isn’t currently supported, you might eventually want to:

  • Either guard the wheel extraction and link logic with #[cfg(not(target_os = "windows"))] and emit a clear panic!/error for Windows builds, or
  • Add the necessary .lib import-library handling once there’s a concrete plan for MSVC.

For now, the behavior matches the documentation: Linux/macOS are first-class, Windows is “best effort” and likely to fail at link time.

Also applies to: 509-538, 548-592, 594-595

src/polars_ext.rs (2)

101-108: Cache locality claim is backwards; consider column-major iteration.

The comment states "for better cache locality," but Polars stores data in column-major format (Arrow). Iterating row-by-row requires jumping between non-contiguous column memory, which has poor cache locality.

Consider iterating column-by-column instead:

// Convert column by column for better cache locality with columnar storage
for col in df.get_columns() {
    let series = col.as_materialized_series();
    for row_idx in 0..num_rows {
        let value = extract_f32_value(series, row_idx)?;
        data.push(value);
    }
}

However, note this produces column-major output. For row-major output (which XGBoost requires), you'd need to transpose. The current approach may be acceptable if DataFrames are typically small, but for large DataFrames, pre-allocating and filling with transposed column data could improve performance.


113-219: Document null handling behavior and potential precision loss.

The function rejects null values with errors rather than converting them to NaN (which XGBoost typically uses for missing values). Additionally, converting large i64/u64 values to f32 can cause precision loss, as f32 has only ~7 significant decimal digits and cannot exactly represent integers beyond 2²⁴.

Consider:

  1. Documenting that nulls are not supported and will error (or optionally, map them to f32::NAN)
  2. Adding a note about precision loss for large integer conversions

Example documentation addition:

/// Extract an f32 value from a Series at the given index
///
/// Supports conversion from all numeric types and booleans.
/// 
/// # Null Handling
/// Returns an error if the value at `idx` is null. Users should fill or drop nulls
/// before prediction, or use XGBoost's native missing value handling with NaN.
///
/// # Precision
/// Converting Int64/UInt64 to f32 may lose precision for values outside [-2²⁴, 2²⁴].
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31c0b11 and b9609ba.

📒 Files selected for processing (11)
  • .github/workflows/ci.yml (1 hunks)
  • Cargo.toml (2 hunks)
  • README.md (2 hunks)
  • build.rs (16 hunks)
  • examples/advanced_usage.rs (3 hunks)
  • examples/basic_usage.rs (1 hunks)
  • src/error.rs (1 hunks)
  • src/lib.rs (1 hunks)
  • src/model.rs (8 hunks)
  • src/polars_ext.rs (1 hunks)
  • src/sys.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/model.rs (1)
src/error.rs (1)
  • check_return_value (15-21)
src/polars_ext.rs (1)
src/model.rs (1)
  • num_features (254-262)
🔇 Additional comments (7)
src/error.rs (1)

1-1: crate::sys import placement looks good

Using a single use crate::sys; at the top matches the sys::XGBGetLastError() usage and removes duplication with no behavioral change.

examples/basic_usage.rs (1)

58-60: Formatting-only tweak is fine

Inline comment spacing around the example data is purely cosmetic and keeps the example clear.

README.md (1)

8-8: Windows support messaging is clearer and more accurate

Dropping Windows from the “Cross-Platform” bullet and adding the dedicated Windows section with limitations and workarounds sets correct expectations for users. When/if Windows auto-download is implemented, remember to update both this section and the CI matrix.

Also applies to: 36-39

examples/advanced_usage.rs (1)

25-25: Advanced example changes preserve behavior and improve error reporting

  • Compact data vector literal is equivalent and more readable.
  • The SHAP loop using iter().enumerate().take(num_features + 1) cleanly prints the first row’s contributions plus bias without manual indexing.
  • Wrapping std::fs::read in map_err to construct an XGBoostError ensures the example compiles cleanly with the crate’s error type and gives a more descriptive message on IO failure.

No issues from a correctness standpoint.

Please just re-run cargo run --example advanced_usage to confirm the mapped error message and SHAP output look as expected on your environment.

Also applies to: 34-44, 56-58

build.rs (1)

169-176: Header checksum table + failure mode look solid

The expanded get_header_checksums() map and the ok_or_else path in download_xgboost_headers give a clear error when a version isn’t covered, and the verify_checksum helper prints a consistent “✓ Verified SHA256” message that matches your CI grep.

Just be aware that every time you bump the default XGBOOST_VERSION or add CI matrix entries, you’ll need to add the corresponding header hashes here or builds will fail with the “No known SHA256 checksums” message—which is a good, explicit failure mode.

Also applies to: 181-204

src/lib.rs (1)

10-13: LGTM!

The feature-gated module and conditional export are implemented correctly.

src/model.rs (1)

78-285: LGTM!

The formatting changes improve code compactness while preserving all logic and semantics.

Comment on lines +55 to +59
# Linux ARM64
- os: linux
arch: arm64
runner: ubuntu-latest
xgboost_version: "3.1.1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

“Linux ARM64” job is not actually running on ARM hardware

The linux / arm64 matrix entry still uses runs-on: ubuntu-latest, which today is an x86_64 runner. The arch: arm64 value is only used in cache keys and the job name; it doesn’t change the underlying runner architecture, so this job effectively duplicates the x86_64 Linux coverage while being labeled as arm64.

Consider either:

  • Switching runner to an actual arm64 runner label (if available in GitHub Actions), or
  • Renaming/removing this matrix entry until genuine arm64 coverage is in place.

Please verify the current GitHub-hosted runner labels for Linux arm64 in the Actions documentation and adjust runner accordingly if you intend to exercise real arm64 binaries.


I'll verify the claims about GitHub Actions runner availability and ARM64 support by searching the current documentation.

Based on verification of current GitHub Actions documentation, the review comment's assessment is accurate. ARM64 runners are available with specific labels: ubuntu-24.04-arm and ubuntu-22.04-arm as of January 2025. The claim that runs-on: ubuntu-latest runs on x86_64 is correct—these specific labels must be used to access arm64 runners, and they will not work in private repositories.

The review comment's suggestion to switch to actual arm64 runner labels is valid and actionable.

Replace runner: ubuntu-latest with runner: ubuntu-24.04-arm (or ubuntu-22.04-arm for Ubuntu 22.04)

This will execute the job on actual ARM64 hardware. Note: arm64 runners are only available in public repositories, so ensure your repository is public.

🤖 Prompt for AI Agents
In .github/workflows/ci.yml around lines 55 to 59, the workflow is incorrectly
specifying runner: ubuntu-latest for an ARM64 job; replace runner: ubuntu-latest
with the explicit ARM64 runner label runner: ubuntu-24.04-arm (or runner:
ubuntu-22.04-arm if you need Ubuntu 22.04) so the job runs on real ARM64
hardware, and keep a note that these arm labels are only available for public
repositories.

rust-version = "1.70"

[dependencies]
polars = { version = "0.45", optional = true, default-features = false, features = ["dtype-full"] }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Verify the Polars version for currency and security advisories.

Ensure that version 0.45 is current and has no known vulnerabilities.


🌐 Web query:

What is the latest version of the Polars Rust crate and are there any security advisories for version 0.45?

💡 Result:

Latest Polars (Rust) release: 0.52.0 (crate documentation / release metadata). [1]

Security advisories for 0.45:

  • There is no RustSec/GitHub advisory that lists polars itself (v0.45) as directly vulnerable. However, Polars v0.45 may pull transitive dependencies that have RustSec advisories (examples include fast-float RUSTSEC-2024-0379 and array-init-cursor RUSTSEC-2025-0019). You should check your dependency tree (cargo tree) and RustSec/GitHub Advisory DB for fixes or patched versions. [2][3]

Sources:

  1. Polars crate docs/release metadata (shows v0.52.0). [1]
  2. RustSec / advisory reporting for fast-float (RUSTSEC-2024-0379) — affects projects that depend on fast-float, including Polars. [2]
  3. RustSec advisory reporting for array-init-cursor (RUSTSEC-2025-0019) — reported transitive issue referenced by Polars. [3]

If you want, I can (a) confirm exactly which advisories affect your project by analyzing your Cargo.lock/Cargo.toml, or (b) fetch the latest crates.io release page and RustSec entries now.


I'll verify whether the transitive vulnerabilities actually affect this project's dependency tree.

[/run_scripts]


🏁 Script executed:

#!/bin/bash
# Check if Cargo.lock exists and search for vulnerable dependencies
if [ -f "Cargo.lock" ]; then
  echo "=== Checking Cargo.lock for transitive dependencies ==="
  grep -A 2 "name = \"fast-float\"" Cargo.lock || echo "fast-float not found in Cargo.lock"
  echo ""
  grep -A 2 "name = \"array-init-cursor\"" Cargo.lock || echo "array-init-cursor not found in Cargo.lock"
else
  echo "Cargo.lock not found"
fi

# Also check Cargo.toml for polars dependency details
echo ""
echo "=== Polars dependency in Cargo.toml ==="
grep -A 2 "polars" Cargo.toml | head -20

Length of output: 223


Upgrade Polars to version 0.52.0.

Version 0.45 is outdated; the latest Polars release is 0.52.0. Additionally, Polars v0.45 may pull transitive dependencies with known vulnerabilities, including fast-float (RUSTSEC-2024-0379) and array-init-cursor (RUSTSEC-2025-0019). Update Cargo.toml to use version = "0.52.0" to address currency and reduce transitive security risk.

🤖 Prompt for AI Agents
In Cargo.toml around line 13, the polars dependency is pinned to version "0.45";
update the version to "0.52.0" by changing the version string to version =
"0.52.0" while keeping the existing optional, default-features, and features
settings intact, then run cargo update -p polars (or cargo update) and
rebuild/tests to ensure compatibility and resolve any required downstream
adjustments.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
Cargo.toml (1)

13-13: Polars version 0.45 is outdated.

This concern was already flagged in a previous review. Version 0.45 is outdated (latest is 0.52.0), and transitive dependencies may have known vulnerabilities.

🧹 Nitpick comments (1)
src/polars_ext.rs (1)

84-123: Cache locality claim is incomplete.

Line 86 states "Optimized column-by-column conversion for better cache locality on source data," which is accurate for reads from Polars' column-major layout. However, the writes to the output buffer use strided access (line 118: row_idx * num_features + col_idx), creating poor cache locality for the destination.

This trade-off is inherent when transposing from column-major to row-major layout. For very wide DataFrames (many features), the strided writes may impact performance. Consider adding a note about this trade-off in the comment.

Example clarification:

-/// Optimized column-by-column conversion for better cache locality on source data.
+/// Convert column-major (Polars) to row-major (XGBoost) layout.
+/// Optimized for sequential reads from source; writes are strided.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9609ba and a222d78.

📒 Files selected for processing (2)
  • Cargo.toml (2 hunks)
  • src/polars_ext.rs (1 hunks)
🔇 Additional comments (3)
src/polars_ext.rs (3)

5-54: Trait API design looks solid.

The trait provides a clean interface for Polars DataFrame predictions. Both methods appropriately use immutable borrows, and the separation between predicting all columns vs. selected columns gives users flexibility.


57-65: Clean delegation to helper and underlying predict method.

The implementation correctly converts the DataFrame and delegates to the existing predict method with proper error propagation.


67-81: Column selection logic is correct.

The method appropriately converts column names, handles selection errors with descriptive messages, and delegates to the conversion helper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant