[Draft] feat: support fixed length range bitmap file index read and write#147
[Draft] feat: support fixed length range bitmap file index read and write#147fafacao86 wants to merge 8 commits intoalibaba:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This draft PR implements support for fixed-length RangeBitmap file index read and write operations in paimon-cpp, following the Java implementation's design as documented in PIP-33. The implementation provides an efficient range-based bitmap index structure for supporting equality and range queries on fixed-length data types (boolean, tinyint, smallint, int, bigint, float, date).
Changes:
- Implements core RangeBitmap data structures including BitSliceIndexBitmap for efficient range queries and ChunkedDictionary for key encoding
- Adds RangeBitmapFileIndex with reader/writer implementations following the existing FileIndexer pattern
- Provides comprehensive test coverage with unit tests for all supported fixed-length types and Java compatibility verification
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/common/file_index/rangebitmap/range_bitmap.h | Core RangeBitmap class with query operations (Eq, Lt, Gt, etc.) and Appender for serialization |
| src/paimon/common/file_index/rangebitmap/range_bitmap.cpp | Implementation of RangeBitmap query logic and serialization |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index.h | FileIndexer interface implementation for RangeBitmap |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index.cpp | RangeBitmapFileIndexReader and RangeBitmapFileIndexWriter implementations |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_factory.h | Factory for creating RangeBitmapFileIndex instances |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_factory.cpp | Factory implementation with registration |
| src/paimon/common/file_index/rangebitmap/bit_slice_index_bitmap.h | Bit-slice index structure for efficient range query operations |
| src/paimon/common/file_index/rangebitmap/bit_slice_index_bitmap.cpp | Bit-slice index implementation with serialization support |
| src/paimon/common/file_index/rangebitmap/dictionary/dictionary.h | Base dictionary interface for key-to-code mapping |
| src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h | Chunked dictionary implementation for scalable key storage |
| src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp | Chunked dictionary read/write logic with binary search |
| src/paimon/common/file_index/rangebitmap/dictionary/chunk.h | Base chunk interface for dictionary data organization |
| src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.h | Fixed-length chunk implementation for constant-size keys |
| src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.cpp | Fixed-length chunk serialization and key retrieval |
| src/paimon/common/file_index/rangebitmap/dictionary/key_factory.h | Factory pattern for creating key serializers/deserializers per field type |
| src/paimon/common/file_index/rangebitmap/dictionary/key_factory.cpp | KeyFactory implementations for all supported fixed-length types |
| src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.h | Utilities for serializing/deserializing Literal values |
| src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.cpp | Literal serialization implementation for all supported types |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_test.cpp | Comprehensive unit tests for all fixed-length types with query operations |
| src/paimon/common/file_index/file_index_format_test.cpp | Java compatibility test verifying cross-language index format |
| src/paimon/common/file_index/CMakeLists.txt | Build configuration for RangeBitmap components |
| src/paimon/CMakeLists.txt | Test registration for RangeBitmap tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int32_t offset); | ||
|
|
||
| BitSliceIndexBitmap(const std::shared_ptr<MemoryPool>& pool, int32_t indexes_length, | ||
| pooled_unique_ptr<Bytes> indexes, int32_t ebm_length, int32_t slices_size, |
There was a problem hiding this comment.
Inconsistent type usage: use PAIMON_UNIQUE_PTR instead of pooled_unique_ptr directly, as seen throughout the rest of the codebase and in the implementation file at line 71. This macro is the standard convention used in other parts of the codebase.
| pooled_unique_ptr<Bytes> indexes, int32_t ebm_length, int32_t slices_size, | |
| PAIMON_UNIQUE_PTR<Bytes> indexes, int32_t ebm_length, int32_t slices_size, |
| #include <optional> | ||
| #include <vector> | ||
|
|
||
| #include "dictionary/dictionary.h" |
There was a problem hiding this comment.
Incorrect include path: should be "paimon/common/file_index/rangebitmap/dictionary/dictionary.h" to follow the absolute include path convention used elsewhere in the codebase.
| #include "dictionary/dictionary.h" | |
| #include "paimon/common/file_index/rangebitmap/dictionary/dictionary.h" |
| int32_t NumberOfLeadingZeros(const int64_t value) { | ||
| if (value == 0) { | ||
| return 64; | ||
| } | ||
| return __builtin_clzll(static_cast<uint64_t>(value)); | ||
| } | ||
|
|
||
| int32_t NumberOfTrailingZeros(const int64_t value) { | ||
| if (value == 0) { | ||
| return 64; | ||
| } | ||
| return __builtin_ctzll(static_cast<uint64_t>(value)); | ||
| } |
There was a problem hiding this comment.
These utility functions should be placed in an anonymous namespace or marked as static since they are only used within this translation unit. This follows C++ best practices to avoid potential ODR (One Definition Rule) violations and makes the scope clear.
| #include "paimon/common/utils/field_type_utils.h" | ||
| #include "paimon/io/data_input_stream.h" | ||
| #include "paimon/memory/bytes.h" | ||
| #include "utils/literal_serialization_utils.h" |
There was a problem hiding this comment.
Incorrect include path: should be "paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.h" to follow the absolute include path convention used elsewhere in the codebase.
| #include "utils/literal_serialization_utils.h" | |
| #include "paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.h" |
| ebm_length_(ebm_length), | ||
| indexes_length_(indexes_length) {} | ||
|
|
||
| Result<const RoaringBitmap32*> BitSliceIndexBitmap::GetEmtpyBitmap() { |
There was a problem hiding this comment.
Function name has a typo: "GetEmtpyBitmap" should be "GetEmptyBitmap". This should be corrected to match the fix in the header file.
| Result<const RoaringBitmap32*> BitSliceIndexBitmap::GetEmtpyBitmap() { | |
| Result<const RoaringBitmap32*> BitSliceIndexBitmap::GetEmptyBitmap() { |
| } | ||
|
|
||
| Result<RoaringBitmap32> BitSliceIndexBitmap::Eq(const int32_t code) { | ||
| PAIMON_ASSIGN_OR_RAISE(const auto empty_bitmap, GetEmtpyBitmap()); |
There was a problem hiding this comment.
Function call has a typo: "GetEmtpyBitmap" should be "GetEmptyBitmap" to match the corrected function name.
| PAIMON_ASSIGN_OR_RAISE(const auto empty_bitmap, GetEmtpyBitmap()); | |
| PAIMON_ASSIGN_OR_RAISE(const auto empty_bitmap, GetEmptyBitmap()); |
|
|
||
| #include <algorithm> | ||
|
|
||
| #include "dictionary/chunked_dictionary.h" |
There was a problem hiding this comment.
Incorrect include path: should be "paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h" to follow the absolute include path convention used elsewhere in the codebase.
| #include "dictionary/chunked_dictionary.h" | |
| #include "paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h" |
| } | ||
| PAIMON_ASSIGN_OR_RAISE(const auto min_compare, key.CompareTo(min_)); | ||
| PAIMON_ASSIGN_OR_RAISE(const auto max_compare, key.CompareTo(max_)); | ||
| PAIMON_ASSIGN_OR_RAISE(const auto bit_slice_ptr, this->GetBitSliceIndex()) |
There was a problem hiding this comment.
Missing semicolon at the end of the statement. This line should end with a semicolon after the closing parenthesis.
| PAIMON_ASSIGN_OR_RAISE(const auto bit_slice_ptr, this->GetBitSliceIndex()) | |
| PAIMON_ASSIGN_OR_RAISE(const auto bit_slice_ptr, this->GetBitSliceIndex()); |
| return static_cast<int32_t>(sizeof(int32_t) + literal.GetValue<std::string>().size()); | ||
| default: | ||
| return Status::Invalid( | ||
| fmt::format("Unsupported field type for GetFixedFieldSize: {}", |
There was a problem hiding this comment.
The error message references "GetFixedFieldSize" but this is in the GetSerializedSizeInBytes function. The error message should say "Unsupported field type for GetSerializedSizeInBytes" to accurately reflect which function encountered the error.
| fmt::format("Unsupported field type for GetFixedFieldSize: {}", | |
| fmt::format("Unsupported field type for GetSerializedSizeInBytes: {}", |
6941957 to
53f0308
Compare
src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.h
Show resolved
Hide resolved
| PAIMON_UNIQUE_PTR<Bytes> keys_; | ||
|
|
||
| // For write path | ||
| std::optional<KeyFactory::KeySerializer> serializer_; |
There was a problem hiding this comment.
I recall that std::function is null by default when default-constructed. Since it already supports a 'not-set' state, we can probably drop the std::optional wrapper for simplicity.
| const std::string& KeyFactory::GetDefaultChunkSize() { | ||
| static const std::string kDefaultChunkSize = "16kb"; | ||
| return kDefaultChunkSize; | ||
| } |
There was a problem hiding this comment.
kDefaultChunkSize could be made a static member of KeyFactory, e.g., static constexpr char kDefaultChunkSize[].
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h
Show resolved
Hide resolved
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp
Show resolved
Hide resolved
|
Thanks for the contribution! Given the size of the change, we'll review in batches. |
|
|
||
| BitSliceIndexBitmap(const std::shared_ptr<MemoryPool>& pool, int32_t indexes_length, | ||
| PAIMON_UNIQUE_PTR<Bytes> indexes, int32_t ebm_length, int32_t slices_size, | ||
| const std::shared_ptr<InputStream>& input_stream, int32_t body_offset); |
There was a problem hiding this comment.
If you want BitSliceIndexBitmap to be constructed only via the Create function, you can make its constructor private to prevent upstream misuse.
| std::shared_ptr<MemoryPool> pool_; | ||
| bool initialized_; | ||
| std::vector<std::optional<RoaringBitmap32>> bit_slices_; | ||
| std::optional<RoaringBitmap32> ebm; |
There was a problem hiding this comment.
ebm maybe ebm_ for private member?
| PAIMON_RETURN_NOT_OK(input_stream_->Read(bytes->data(), ebm_length_)); | ||
| RoaringBitmap32 bitmap; | ||
| PAIMON_RETURN_NOT_OK(bitmap.Deserialize(bytes->data(), ebm_length_)); | ||
| ebm = bitmap; |
There was a problem hiding this comment.
ebm = std::move(bitmap);
bit_slices_[idx] = std::move(bitmap);
|
|
||
| /// Batch load slices from start to end to avoid unnecessary IO | ||
| Status BitSliceIndexBitmap::LoadSlices(const int32_t start, const int32_t end) { | ||
| if (initialized_) { |
There was a problem hiding this comment.
It looks like GetSliceBitmap and LoadSlices have significant code duplication. Consider having GetSliceBitmap delegate to LoadSlices(idx, idx + 1) instead of reimplementing similar logic. Could you evaluate if this refactoring improves clarity and simplify?
|
|
||
| Result<RoaringBitmap32> BitSliceIndexBitmap::Eq(const int32_t code) { | ||
| PAIMON_ASSIGN_OR_RAISE(const RoaringBitmap32* empty_bitmap, GetEmptyBitmap()); | ||
| auto state = RoaringBitmap32(*empty_bitmap); |
There was a problem hiding this comment.
GetEmptyBitmap() might be a bit confusing — the term "empty" could imply null, but from the code it looks like it returns a non-null bitmap. Would GetNotNullBitmap() be clearer?
| RoaringBitmap32 bitmap; | ||
| PAIMON_RETURN_NOT_OK(bitmap.Deserialize(bytes->data(), ebm_length_)); | ||
| ebm = bitmap; | ||
| } |
There was a problem hiding this comment.
It looks like IsNotNull and GetEmptyBitmatp have some code duplication. Refactor is better.
| BitSliceIndexBitmap::Appender::Appender(const std::shared_ptr<MemoryPool>& pool, const int32_t min, | ||
| const int32_t max) | ||
| : pool_(pool), min_(min), max_(max) { | ||
| ebm_ = RoaringBitmap32{}; |
There was a problem hiding this comment.
Generally, we place memory pool parameters at the end of function parameter lists, and memory pool member variables at the beginning of class member declarations (which helps ensure safe destruction order).
|
|
||
| BitSliceIndexBitmap::Appender::Appender(const std::shared_ptr<MemoryPool>& pool, const int32_t min, | ||
| const int32_t max) | ||
| : pool_(pool), min_(min), max_(max) { |
There was a problem hiding this comment.
Is const really necessary for the min and max variables?
| const Literal& max, const std::shared_ptr<KeyFactory>& key_factory, | ||
| const std::shared_ptr<InputStream>& input_stream); | ||
| Result<BitSliceIndexBitmap* const> GetBitSliceIndex(); | ||
| Result<Dictionary* const> GetDictionary(); |
There was a problem hiding this comment.
Did you mean to return a const T* instead?
There was a problem hiding this comment.
Most of the methods of Dictionary and BitSliceIndex are non-const, because they might do some lazy loading logic, which modifies the member variables. So I think the object that the pointer points to needs to be modifiable using a T * const?
There was a problem hiding this comment.
Got it. Maybe T* is enough?
| Result<RoaringBitmap32> IsNotNull(); | ||
|
|
||
| private: | ||
| Result<RoaringBitmap32> Not(RoaringBitmap32& bitmap); |
There was a problem hiding this comment.
As a common convention in paimon-cpp, output parameters usually uses ptr (e.g., T* out) and placed after input parameters.
| PAIMON_ASSIGN_OR_RAISE(parsed_chunk_size, MemorySize::ParseBytes(chunk_size_it->second)); | ||
| } | ||
| if (parsed_chunk_size > std::numeric_limits<int32_t>::max()) { | ||
| return Status::Invalid("Chunk size must be less than 4GB"); |
There was a problem hiding this comment.
std::numeric_limits<int32_t>::max() seems to be 2GB?
| const Literal& literal) { | ||
| return std::make_shared<BitmapIndexResult>( | ||
| [self = shared_from_this(), literal]() -> Result<RoaringBitmap32> { | ||
| if (!self->range_bitmap_) return RoaringBitmap32(); |
There was a problem hiding this comment.
Please add {} around single-line if statements.
| RangeBitmapFileIndexTest* test, const std::shared_ptr<arrow::DataType>& arrow_type, | ||
| const std::vector<ValueType>& test_data, PAIMON_UNIQUE_PTR<Bytes>* serialized_bytes_out) { | ||
| return CreateReaderForTest<ArrowBuilder, ValueType>(test, arrow_type, test_data, {}, | ||
| serialized_bytes_out); |
There was a problem hiding this comment.
CreateReaderForTest can be member function for RangeBitmapFileIndexTest to avoid test parameter.
| std::shared_ptr<arrow::Array> arrow_array; | ||
| status = builder->Finish(&arrow_array); | ||
| if (!status.ok()) { | ||
| return Status::Invalid(fmt::format("Failed to finish builder: {}", status.ToString())); |
There was a problem hiding this comment.
Consider use PAIMON_ASSIGN_OR_RAISE_FROM_ARROW and PAIMON_RETURN_NOT_OK_FROM_ARROW to avoid if statement.
| Status RangeBitmapFileIndexWriter::AddBatch(::ArrowArray* batch) { | ||
| PAIMON_ASSIGN_OR_RAISE_FROM_ARROW(std::shared_ptr<arrow::Array> array, | ||
| arrow::ImportArray(batch, arrow_type_)); | ||
| PAIMON_ASSIGN_OR_RAISE(std::vector<Literal> array_values, |
There was a problem hiding this comment.
Input batch must be a struct array (with only one field in range bitmap). Please see bitmap implement as references.
| expected_gt_49.reserve(50); | ||
| for (int32_t i = 50; i < 100; ++i) { | ||
| expected_gt_49.push_back(i); | ||
| } |
There was a problem hiding this comment.
Consider using std::iota to simplify this initialization.
|
I've done a high-level review of this PR — overall, the implementation is solid and well-executed, great work! For smoother reviews in the future, could we consider splitting such changes into smaller, more focused PRs? It would help us dive deeper into the details during review. Additionally, for the files under the |
Yeah, that's also what I thought. I'll break down the tasks under #146 and address the code quality and test coverage feedback from this review. I'll submit smaller, more focused PRs going forward. Thank you for the thorough review — I've learned A LOT from it! I'm still relatively new to C++, sorry for any odd coding patterns/style you came across 😸 |
| this, arrow_type, test_data, &serialized_bytes))); | ||
| ASSERT_OK_AND_ASSIGN(auto eq_10_5_result, reader->VisitEqual(Literal(10.5))); | ||
| CheckResult(eq_10_5_result, {0, 2}); // positions with value 10.5 | ||
| ASSERT_OK_AND_ASSIGN(auto eq_20_3_result, reader->VisitEqual(Literal(20.3))); |
There was a problem hiding this comment.
BTW, I'm a bit curious — could floating-point precision errors (e.g., for float/double) lead to incorrect index lookup or recall results?
For example, when comparing keys or computing positions, small rounding differences might cause mismatches.
There was a problem hiding this comment.
This is indeed something which requires extra care. Java uses JDK provided java.util.Comparator class for float/double comparison. I'll do some investigation and experiment with Java implementation later for this concern. And see what we need to do in paimon-cpp to be compatible with it.
Purpose
Linked issue: close #146
Support fixed length rangebitmap file index Read and Write.
The code structure basically follows Java implementation's class structure.
Tests
API and Format
Documentation