Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Jan 19, 2026

This PR fixes a bug where users could encode blob v1 even when blob v2 was enabled. However, our decoder only reads the dataset's configuration lance.blob.version, which can lead to decoding issues.


Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@Xuanwo Xuanwo requested a review from westonpace January 19, 2026 15:14
@github-actions github-actions bot added the bug Something isn't working label Jan 19, 2026
@github-actions
Copy link
Contributor

PR Review

Summary

This PR ensures that when blob v2 is enabled (file version >= 2.2), all blob-marked fields use the v2 encoder regardless of their input type (Binary/LargeBinary or Struct). The key addition is a normalize_blob_v2_input function that converts various input formats to the normalized v2 struct format.

Critical Issues (P0)

1. Potential panic from unchecked cast operations
In normalize_blob_v2_input, the code calls .as_binary::<i64>() and .as_string::<i32>() without checking if the actual array type matches. If the struct has a data field that is not LargeBinary or a uri field that is not Utf8, these will panic:

// blob.rs:455-457
let data_col = data_col.as_binary::<i64>();
let uri_col = uri_col.as_string::<i32>();

Consider using as_binary_opt and as_string_opt with proper error handling, similar to the pattern used in the Binary/LargeBinary branch.

Moderate Issues (P1)

2. Missing unit tests for normalize_blob_v2_input
The new normalize_blob_v2_input function handles several conversion paths (6-field struct passthrough, 2-field struct conversion, Binary, LargeBinary) but has no dedicated unit tests. Only the integration test in take.rs exercises one path. Add unit tests to cover:

  • Invalid struct with wrong number of fields
  • Struct with missing required fields
  • Binary vs LargeBinary input normalization
  • Null handling in various paths

3. Test assertion may be fragile
The test test_take_blob_v2_from_legacy_large_binary_on_v2_2 asserts exact field names and positions:

assert_eq!(struct_arr.fields()[0].name(), "kind");
assert_eq!(struct_arr.fields()[1].name(), "position");

Consider using column_by_name for field access instead of positional indexing, matching the pattern used in the encoder itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants