Skip to content

[Draft] feat: support fixed length range bitmap file index read and write#147

Draft
fafacao86 wants to merge 8 commits intoalibaba:mainfrom
fafacao86:rangebitmap
Draft

[Draft] feat: support fixed length range bitmap file index read and write#147
fafacao86 wants to merge 8 commits intoalibaba:mainfrom
fafacao86:rangebitmap

Conversation

@fafacao86
Copy link

Purpose

Linked issue: close #146

Support fixed length rangebitmap file index Read and Write.
The code structure basically follows Java implementation's class structure.

Tests

API and Format

Documentation

@fafacao86 fafacao86 marked this pull request as draft February 24, 2026 02:07
@CLAassistant
Copy link

CLAassistant commented Feb 24, 2026

CLA assistant check
All committers have signed the CLA.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This draft PR implements support for fixed-length RangeBitmap file index read and write operations in paimon-cpp, following the Java implementation's design as documented in PIP-33. The implementation provides an efficient range-based bitmap index structure for supporting equality and range queries on fixed-length data types (boolean, tinyint, smallint, int, bigint, float, date).

Changes:

  • Implements core RangeBitmap data structures including BitSliceIndexBitmap for efficient range queries and ChunkedDictionary for key encoding
  • Adds RangeBitmapFileIndex with reader/writer implementations following the existing FileIndexer pattern
  • Provides comprehensive test coverage with unit tests for all supported fixed-length types and Java compatibility verification

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/paimon/common/file_index/rangebitmap/range_bitmap.h Core RangeBitmap class with query operations (Eq, Lt, Gt, etc.) and Appender for serialization
src/paimon/common/file_index/rangebitmap/range_bitmap.cpp Implementation of RangeBitmap query logic and serialization
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index.h FileIndexer interface implementation for RangeBitmap
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index.cpp RangeBitmapFileIndexReader and RangeBitmapFileIndexWriter implementations
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_factory.h Factory for creating RangeBitmapFileIndex instances
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_factory.cpp Factory implementation with registration
src/paimon/common/file_index/rangebitmap/bit_slice_index_bitmap.h Bit-slice index structure for efficient range query operations
src/paimon/common/file_index/rangebitmap/bit_slice_index_bitmap.cpp Bit-slice index implementation with serialization support
src/paimon/common/file_index/rangebitmap/dictionary/dictionary.h Base dictionary interface for key-to-code mapping
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h Chunked dictionary implementation for scalable key storage
src/paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.cpp Chunked dictionary read/write logic with binary search
src/paimon/common/file_index/rangebitmap/dictionary/chunk.h Base chunk interface for dictionary data organization
src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.h Fixed-length chunk implementation for constant-size keys
src/paimon/common/file_index/rangebitmap/dictionary/fixed_length_chunk.cpp Fixed-length chunk serialization and key retrieval
src/paimon/common/file_index/rangebitmap/dictionary/key_factory.h Factory pattern for creating key serializers/deserializers per field type
src/paimon/common/file_index/rangebitmap/dictionary/key_factory.cpp KeyFactory implementations for all supported fixed-length types
src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.h Utilities for serializing/deserializing Literal values
src/paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.cpp Literal serialization implementation for all supported types
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_test.cpp Comprehensive unit tests for all fixed-length types with query operations
src/paimon/common/file_index/file_index_format_test.cpp Java compatibility test verifying cross-language index format
src/paimon/common/file_index/CMakeLists.txt Build configuration for RangeBitmap components
src/paimon/CMakeLists.txt Test registration for RangeBitmap tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

int32_t offset);

BitSliceIndexBitmap(const std::shared_ptr<MemoryPool>& pool, int32_t indexes_length,
pooled_unique_ptr<Bytes> indexes, int32_t ebm_length, int32_t slices_size,
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent type usage: use PAIMON_UNIQUE_PTR instead of pooled_unique_ptr directly, as seen throughout the rest of the codebase and in the implementation file at line 71. This macro is the standard convention used in other parts of the codebase.

Suggested change
pooled_unique_ptr<Bytes> indexes, int32_t ebm_length, int32_t slices_size,
PAIMON_UNIQUE_PTR<Bytes> indexes, int32_t ebm_length, int32_t slices_size,

Copilot uses AI. Check for mistakes.
#include <optional>
#include <vector>

#include "dictionary/dictionary.h"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect include path: should be "paimon/common/file_index/rangebitmap/dictionary/dictionary.h" to follow the absolute include path convention used elsewhere in the codebase.

Suggested change
#include "dictionary/dictionary.h"
#include "paimon/common/file_index/rangebitmap/dictionary/dictionary.h"

Copilot uses AI. Check for mistakes.
Comment on lines 55 to 67
int32_t NumberOfLeadingZeros(const int64_t value) {
if (value == 0) {
return 64;
}
return __builtin_clzll(static_cast<uint64_t>(value));
}

int32_t NumberOfTrailingZeros(const int64_t value) {
if (value == 0) {
return 64;
}
return __builtin_ctzll(static_cast<uint64_t>(value));
}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These utility functions should be placed in an anonymous namespace or marked as static since they are only used within this translation unit. This follows C++ best practices to avoid potential ODR (One Definition Rule) violations and makes the scope clear.

Copilot uses AI. Check for mistakes.
#include "paimon/common/utils/field_type_utils.h"
#include "paimon/io/data_input_stream.h"
#include "paimon/memory/bytes.h"
#include "utils/literal_serialization_utils.h"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect include path: should be "paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.h" to follow the absolute include path convention used elsewhere in the codebase.

Suggested change
#include "utils/literal_serialization_utils.h"
#include "paimon/common/file_index/rangebitmap/utils/literal_serialization_utils.h"

Copilot uses AI. Check for mistakes.
ebm_length_(ebm_length),
indexes_length_(indexes_length) {}

Result<const RoaringBitmap32*> BitSliceIndexBitmap::GetEmtpyBitmap() {
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name has a typo: "GetEmtpyBitmap" should be "GetEmptyBitmap". This should be corrected to match the fix in the header file.

Suggested change
Result<const RoaringBitmap32*> BitSliceIndexBitmap::GetEmtpyBitmap() {
Result<const RoaringBitmap32*> BitSliceIndexBitmap::GetEmptyBitmap() {

Copilot uses AI. Check for mistakes.
}

Result<RoaringBitmap32> BitSliceIndexBitmap::Eq(const int32_t code) {
PAIMON_ASSIGN_OR_RAISE(const auto empty_bitmap, GetEmtpyBitmap());
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function call has a typo: "GetEmtpyBitmap" should be "GetEmptyBitmap" to match the corrected function name.

Suggested change
PAIMON_ASSIGN_OR_RAISE(const auto empty_bitmap, GetEmtpyBitmap());
PAIMON_ASSIGN_OR_RAISE(const auto empty_bitmap, GetEmptyBitmap());

Copilot uses AI. Check for mistakes.

#include <algorithm>

#include "dictionary/chunked_dictionary.h"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect include path: should be "paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h" to follow the absolute include path convention used elsewhere in the codebase.

Suggested change
#include "dictionary/chunked_dictionary.h"
#include "paimon/common/file_index/rangebitmap/dictionary/chunked_dictionary.h"

Copilot uses AI. Check for mistakes.
}
PAIMON_ASSIGN_OR_RAISE(const auto min_compare, key.CompareTo(min_));
PAIMON_ASSIGN_OR_RAISE(const auto max_compare, key.CompareTo(max_));
PAIMON_ASSIGN_OR_RAISE(const auto bit_slice_ptr, this->GetBitSliceIndex())
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing semicolon at the end of the statement. This line should end with a semicolon after the closing parenthesis.

Suggested change
PAIMON_ASSIGN_OR_RAISE(const auto bit_slice_ptr, this->GetBitSliceIndex())
PAIMON_ASSIGN_OR_RAISE(const auto bit_slice_ptr, this->GetBitSliceIndex());

Copilot uses AI. Check for mistakes.
return static_cast<int32_t>(sizeof(int32_t) + literal.GetValue<std::string>().size());
default:
return Status::Invalid(
fmt::format("Unsupported field type for GetFixedFieldSize: {}",
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message references "GetFixedFieldSize" but this is in the GetSerializedSizeInBytes function. The error message should say "Unsupported field type for GetSerializedSizeInBytes" to accurately reflect which function encountered the error.

Suggested change
fmt::format("Unsupported field type for GetFixedFieldSize: {}",
fmt::format("Unsupported field type for GetSerializedSizeInBytes: {}",

Copilot uses AI. Check for mistakes.
PAIMON_UNIQUE_PTR<Bytes> keys_;

// For write path
std::optional<KeyFactory::KeySerializer> serializer_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that std::function is null by default when default-constructed. Since it already supports a 'not-set' state, we can probably drop the std::optional wrapper for simplicity.

const std::string& KeyFactory::GetDefaultChunkSize() {
static const std::string kDefaultChunkSize = "16kb";
return kDefaultChunkSize;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kDefaultChunkSize could be made a static member of KeyFactory, e.g., static constexpr char kDefaultChunkSize[].

@lxy-9602
Copy link
Collaborator

Thanks for the contribution! Given the size of the change, we'll review in batches.
The structure looks good, but please adjust the code style to match paimon-cpp conventions where possible.
We'll follow up on the rest soon — really appreciate your work!


BitSliceIndexBitmap(const std::shared_ptr<MemoryPool>& pool, int32_t indexes_length,
PAIMON_UNIQUE_PTR<Bytes> indexes, int32_t ebm_length, int32_t slices_size,
const std::shared_ptr<InputStream>& input_stream, int32_t body_offset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want BitSliceIndexBitmap to be constructed only via the Create function, you can make its constructor private to prevent upstream misuse.

std::shared_ptr<MemoryPool> pool_;
bool initialized_;
std::vector<std::optional<RoaringBitmap32>> bit_slices_;
std::optional<RoaringBitmap32> ebm;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ebm maybe ebm_ for private member?

PAIMON_RETURN_NOT_OK(input_stream_->Read(bytes->data(), ebm_length_));
RoaringBitmap32 bitmap;
PAIMON_RETURN_NOT_OK(bitmap.Deserialize(bytes->data(), ebm_length_));
ebm = bitmap;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ebm = std::move(bitmap);
bit_slices_[idx] = std::move(bitmap);


/// Batch load slices from start to end to avoid unnecessary IO
Status BitSliceIndexBitmap::LoadSlices(const int32_t start, const int32_t end) {
if (initialized_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like GetSliceBitmap and LoadSlices have significant code duplication. Consider having GetSliceBitmap delegate to LoadSlices(idx, idx + 1) instead of reimplementing similar logic. Could you evaluate if this refactoring improves clarity and simplify?


Result<RoaringBitmap32> BitSliceIndexBitmap::Eq(const int32_t code) {
PAIMON_ASSIGN_OR_RAISE(const RoaringBitmap32* empty_bitmap, GetEmptyBitmap());
auto state = RoaringBitmap32(*empty_bitmap);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetEmptyBitmap() might be a bit confusing — the term "empty" could imply null, but from the code it looks like it returns a non-null bitmap. Would GetNotNullBitmap() be clearer?

RoaringBitmap32 bitmap;
PAIMON_RETURN_NOT_OK(bitmap.Deserialize(bytes->data(), ebm_length_));
ebm = bitmap;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like IsNotNull and GetEmptyBitmatp have some code duplication. Refactor is better.

BitSliceIndexBitmap::Appender::Appender(const std::shared_ptr<MemoryPool>& pool, const int32_t min,
const int32_t max)
: pool_(pool), min_(min), max_(max) {
ebm_ = RoaringBitmap32{};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we place memory pool parameters at the end of function parameter lists, and memory pool member variables at the beginning of class member declarations (which helps ensure safe destruction order).


BitSliceIndexBitmap::Appender::Appender(const std::shared_ptr<MemoryPool>& pool, const int32_t min,
const int32_t max)
: pool_(pool), min_(min), max_(max) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is const really necessary for the min and max variables?

const Literal& max, const std::shared_ptr<KeyFactory>& key_factory,
const std::shared_ptr<InputStream>& input_stream);
Result<BitSliceIndexBitmap* const> GetBitSliceIndex();
Result<Dictionary* const> GetDictionary();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to return a const T* instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the methods of Dictionary and BitSliceIndex are non-const, because they might do some lazy loading logic, which modifies the member variables. So I think the object that the pointer points to needs to be modifiable using a T * const?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Maybe T* is enough?

Result<RoaringBitmap32> IsNotNull();

private:
Result<RoaringBitmap32> Not(RoaringBitmap32& bitmap);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a common convention in paimon-cpp, output parameters usually uses ptr (e.g., T* out) and placed after input parameters.

PAIMON_ASSIGN_OR_RAISE(parsed_chunk_size, MemorySize::ParseBytes(chunk_size_it->second));
}
if (parsed_chunk_size > std::numeric_limits<int32_t>::max()) {
return Status::Invalid("Chunk size must be less than 4GB");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::numeric_limits<int32_t>::max() seems to be 2GB?

const Literal& literal) {
return std::make_shared<BitmapIndexResult>(
[self = shared_from_this(), literal]() -> Result<RoaringBitmap32> {
if (!self->range_bitmap_) return RoaringBitmap32();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add {} around single-line if statements.

RangeBitmapFileIndexTest* test, const std::shared_ptr<arrow::DataType>& arrow_type,
const std::vector<ValueType>& test_data, PAIMON_UNIQUE_PTR<Bytes>* serialized_bytes_out) {
return CreateReaderForTest<ArrowBuilder, ValueType>(test, arrow_type, test_data, {},
serialized_bytes_out);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateReaderForTest can be member function for RangeBitmapFileIndexTest to avoid test parameter.

std::shared_ptr<arrow::Array> arrow_array;
status = builder->Finish(&arrow_array);
if (!status.ok()) {
return Status::Invalid(fmt::format("Failed to finish builder: {}", status.ToString()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider use PAIMON_ASSIGN_OR_RAISE_FROM_ARROW and PAIMON_RETURN_NOT_OK_FROM_ARROW to avoid if statement.

Status RangeBitmapFileIndexWriter::AddBatch(::ArrowArray* batch) {
PAIMON_ASSIGN_OR_RAISE_FROM_ARROW(std::shared_ptr<arrow::Array> array,
arrow::ImportArray(batch, arrow_type_));
PAIMON_ASSIGN_OR_RAISE(std::vector<Literal> array_values,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input batch must be a struct array (with only one field in range bitmap). Please see bitmap implement as references.

expected_gt_49.reserve(50);
for (int32_t i = 50; i < 100; ++i) {
expected_gt_49.push_back(i);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using std::iota to simplify this initialization.

@lxy-9602
Copy link
Collaborator

I've done a high-level review of this PR — overall, the implementation is solid and well-executed, great work!

For smoother reviews in the future, could we consider splitting such changes into smaller, more focused PRs? It would help us dive deeper into the details during review.

Additionally, for the files under the dictionary/ directory — especially those involving binary search or similar logic — it would be good to add more test coverage. You might refer to the test cases in the Java Paimon's test/ directory for inspiration.

@fafacao86
Copy link
Author

For smoother reviews in the future, could we consider splitting such changes into smaller, more focused PRs? It would help us dive deeper into the details during review.

Yeah, that's also what I thought. I'll break down the tasks under #146 and address the code quality and test coverage feedback from this review. I'll submit smaller, more focused PRs going forward.

Thank you for the thorough review — I've learned A LOT from it! I'm still relatively new to C++, sorry for any odd coding patterns/style you came across 😸

this, arrow_type, test_data, &serialized_bytes)));
ASSERT_OK_AND_ASSIGN(auto eq_10_5_result, reader->VisitEqual(Literal(10.5)));
CheckResult(eq_10_5_result, {0, 2}); // positions with value 10.5
ASSERT_OK_AND_ASSIGN(auto eq_20_3_result, reader->VisitEqual(Literal(20.3)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I'm a bit curious — could floating-point precision errors (e.g., for float/double) lead to incorrect index lookup or recall results?
For example, when comparing keys or computing positions, small rounding differences might cause mismatches.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed something which requires extra care. Java uses JDK provided java.util.Comparator class for float/double comparison. I'll do some investigation and experiment with Java implementation later for this concern. And see what we need to do in paimon-cpp to be compatible with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support RangeBitmap File Index

4 participants