docs: add cpp build guides with examples#376
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces C++ “system/build” dependency-mode guidance and wires up Bazel/CMake configuration to support those flows for the Fluss C++ bindings.
Changes:
- Add Bazel and CMake usage guides for C++ (system/build modes) plus runnable Bazel consumer examples.
- Introduce a Bazel bzlmod extension to provision Arrow C++ (build/system) and a stable
arrow_cpp_depalias target. - Add a
protocbootstrap script and update CMake/Bazel builds to require/propagateprotocfor Rustprost-buildduring C++ builds.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/cpp-cmake-usage.md | Adds CMake system/build usage documentation and validation commands. |
| docs/cpp-bazel-usage.md | Adds Bazel system/build usage documentation, examples, and upgrade notes. |
| bindings/cpp/scripts/ensure_protoc.sh | Adds a helper script intended to download/use a baseline protoc. |
| bindings/cpp/examples/bazel-consumer/system/* | Adds runnable Bazel consumer example for system-mode dependencies. |
| bindings/cpp/examples/bazel-consumer/build/* | Adds runnable Bazel consumer example for build-mode dependencies. |
| bazel/cpp/deps.bzl | Adds the bzlmod extension that provisions Arrow C++ for build/system modes. |
| bazel/cpp/BUILD.bazel | Adds a stable alias arrow_cpp_dep to decouple internal repo naming. |
| bindings/cpp/bazel/cpp/deps.bzl | Duplicates the bzlmod extension under bindings/cpp/ (currently unused). |
| bindings/cpp/bazel/cpp/BUILD.bazel | Duplicates the arrow_cpp_dep alias under bindings/cpp/ (currently unused). |
| bindings/cpp/MODULE.bazel | Registers toolchains and configures the new cpp_sdk extension for local builds. |
| MODULE.bazel | Adds repo-level bzlmod configuration mirroring bindings/cpp/MODULE.bazel. |
| bindings/cpp/CMakeLists.txt | Adds CMake system/build dependency mode selection, tool detection, and Arrow build-mode provisioning. |
| bindings/cpp/BUILD.bazel | Ensures PROTOC is set for cargo genrules; wires in //bazel/cpp:arrow_cpp_dep. |
| bindings/cpp/.gitignore | Adjusts ignore rules to keep Bazel consumer examples while ignoring their artifacts. |
| BUILD.bazel | Adds top-level alias targets (//:fluss_cpp, //:consume_table, etc.) for Bazel consumers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sed -i \ | ||
| -e 's|path = "../../../../../"|path = "/home/admin/mh/fluss-r2/fluss-rust"|' \ | ||
| -e 's|system_arrow_prefix = "/usr"|system_arrow_prefix = "/tmp/fluss-system-arrow-19.0.1"|' \ | ||
| -e 's|system_arrow_shared_library = "lib/x86_64-linux-gnu/libarrow.so"|system_arrow_shared_library = "lib/libarrow.so"|' \ | ||
| -e 's|system_arrow_runtime_glob = "lib/x86_64-linux-gnu/libarrow.so\\*"|system_arrow_runtime_glob = "lib/libarrow.so*"|' \ | ||
| "$tmp_dir/MODULE.bazel" |
There was a problem hiding this comment.
The troubleshooting example uses sed -i (GNU sed syntax). On macOS/BSD sed, -i requires a backup suffix (e.g. -i ''), so the command as written will fail for macOS users. Consider either using a portable approach (e.g. perl -pi -e ...) or documenting the macOS variant alongside the Linux command.
| sed -i \ | |
| -e 's|path = "../../../../../"|path = "/home/admin/mh/fluss-r2/fluss-rust"|' \ | |
| -e 's|system_arrow_prefix = "/usr"|system_arrow_prefix = "/tmp/fluss-system-arrow-19.0.1"|' \ | |
| -e 's|system_arrow_shared_library = "lib/x86_64-linux-gnu/libarrow.so"|system_arrow_shared_library = "lib/libarrow.so"|' \ | |
| -e 's|system_arrow_runtime_glob = "lib/x86_64-linux-gnu/libarrow.so\\*"|system_arrow_runtime_glob = "lib/libarrow.so*"|' \ | |
| "$tmp_dir/MODULE.bazel" | |
| perl -pi -e ' | |
| s|path = "../../../../../"|path = "/home/admin/mh/fluss-r2/fluss-rust"|; | |
| s|system_arrow_prefix = "/usr"|system_arrow_prefix = "/tmp/fluss-system-arrow-19.0.1"|; | |
| s|system_arrow_shared_library = "lib/x86_64-linux-gnu/libarrow.so"|system_arrow_shared_library = "lib/libarrow.so"|; | |
| s|system_arrow_runtime_glob = "lib/x86_64-linux-gnu/libarrow.so\*"|system_arrow_runtime_glob = "lib/libarrow.so\*|; | |
| ' "$tmp_dir/MODULE.bazel" |
| normalize_version_for_protoc_release() { | ||
| local v="$1" | ||
| # Protobuf release packaging switched from v3.x.y to vX.Y for newer versions. | ||
| # For our current agreed baseline (3.25.5), the protoc archive/tag is 25.5. | ||
| if [[ "$v" =~ ^3\.([0-9]+\.[0-9]+)$ ]]; then | ||
| local stripped="${BASH_REMATCH[1]}" | ||
| local major="${stripped%%.*}" | ||
| if [[ "$major" -ge 21 ]]; then | ||
| echo "$stripped" | ||
| return 0 | ||
| fi | ||
| fi | ||
| echo "$v" | ||
| } |
There was a problem hiding this comment.
normalize_version_for_protoc_release() doesn’t handle baseline versions like 3.25.5 correctly. With the current regex it will return 3.25.5, so the script will try to download .../releases/download/v3.25.5/protoc-3.25.5-...zip, which doesn’t exist for modern protobuf releases (they use tags like v25.5 / archives like protoc-25.5-...). Please update the normalization to map 3.<major>.<patch> to <major>.<patch> for the affected ranges (e.g. 3.25.5 -> 25.5) and keep version_matches_baseline() consistent with that mapping.
| if command -v wget >/dev/null 2>&1; then | ||
| local wget_args=() | ||
| if [[ -n "${https_proxy:-}" || -n "${http_proxy:-}" ]]; then | ||
| wget_args+=(--no-check-certificate -e use_proxy=yes) | ||
| if [[ -n "${https_proxy:-}" ]]; then | ||
| wget_args+=(-e "https_proxy=${https_proxy}") | ||
| fi | ||
| if [[ -n "${http_proxy:-}" ]]; then | ||
| wget_args+=(-e "http_proxy=${http_proxy}") | ||
| fi |
There was a problem hiding this comment.
Similar to the curl path, the wget proxy path disables TLS verification (--no-check-certificate). This should not be enabled automatically just because a proxy is configured; keep verification on by default and require an explicit opt-in for insecure downloads.
| download_file() { | ||
| local url="$1" | ||
| local out="$2" | ||
|
|
||
| if command -v curl >/dev/null 2>&1; then | ||
| if [[ -n "${https_proxy:-}" || -n "${http_proxy:-}" ]]; then | ||
| curl -fLk "$url" -o "$out" | ||
| else | ||
| curl -fL "$url" -o "$out" | ||
| fi | ||
| return 0 | ||
| fi | ||
|
|
||
| if command -v wget >/dev/null 2>&1; then | ||
| local wget_args=() | ||
| if [[ -n "${https_proxy:-}" || -n "${http_proxy:-}" ]]; then | ||
| wget_args+=(--no-check-certificate -e use_proxy=yes) | ||
| if [[ -n "${https_proxy:-}" ]]; then | ||
| wget_args+=(-e "https_proxy=${https_proxy}") | ||
| fi | ||
| if [[ -n "${http_proxy:-}" ]]; then | ||
| wget_args+=(-e "http_proxy=${http_proxy}") | ||
| fi | ||
| fi | ||
| wget "${wget_args[@]}" -O "$out" "$url" | ||
| return 0 | ||
| fi | ||
|
|
||
| echo "ERROR: neither curl nor wget is available for downloading protoc." >&2 | ||
| return 1 | ||
| } | ||
|
|
||
| ensure_zip_tools() { | ||
| command -v unzip >/dev/null 2>&1 || { | ||
| echo "ERROR: unzip not found." >&2 | ||
| exit 1 | ||
| } | ||
| } | ||
|
|
||
| setup_proxy_env | ||
| ensure_zip_tools | ||
|
|
||
| if command -v protoc >/dev/null 2>&1; then | ||
| existing_out="$(protoc --version 2>/dev/null || true)" | ||
| if [[ "$existing_out" =~ ([0-9]+\.[0-9]+\.[0-9]+) ]]; then | ||
| existing_ver="${BASH_REMATCH[1]}" | ||
| if version_matches_baseline "$existing_ver" "$PROTOBUF_BASELINE_VERSION"; then | ||
| command -v protoc | ||
| exit 0 | ||
| fi | ||
| fi | ||
| fi | ||
|
|
||
| PROTOC_RELEASE_VERSION="$(normalize_version_for_protoc_release "$PROTOBUF_BASELINE_VERSION")" | ||
| PROTOC_ARCHIVE="protoc-${PROTOC_RELEASE_VERSION}-${PROTOC_OS}-${PROTOC_ARCH}.zip" | ||
| PROTOC_URL="https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_RELEASE_VERSION}/${PROTOC_ARCHIVE}" | ||
| PROTOC_PREFIX="${PROTOC_INSTALL_ROOT}/protoc-${PROTOC_RELEASE_VERSION}-${PROTOC_OS}-${PROTOC_ARCH}" |
There was a problem hiding this comment.
The download_file helper disables TLS certificate verification (curl -fLk and wget --no-check-certificate) when proxy variables are set and also downloads the protoc binary from GitHub without any checksum or signature verification. This makes it possible for a network attacker (e.g., via a malicious or compromised proxy or MITM) or a compromised distribution channel to serve a tampered protoc archive that will be executed in your build environment. Harden this by enforcing TLS certificate verification and validating the downloaded archive against a pinned checksum (or signature) before extracting and using the binary.
| PROTOBUF_BASELINE_VERSION="${PROTOBUF_BASELINE_VERSION:-3.25.5}" | ||
| PROTOC_INSTALL_ROOT="${PROTOC_INSTALL_ROOT:-/tmp/fluss-cpp-tools}" |
There was a problem hiding this comment.
PROTOC_INSTALL_ROOT defaults to a shared /tmp/fluss-cpp-tools directory, and the script trusts any existing protoc binary found there without validating its origin or integrity. On multi-user systems another local user can pre-populate that cache path with a malicious executable that will then be used by your build, enabling code execution in the victim’s context. Use a cache location under the invoking user’s home directory (or another non-world-writable path) and/or validate ownership and contents before trusting an existing binary.
bazel/cpp/deps.bzl
Outdated
| http_archive( | ||
| name = "apache_arrow_cpp", | ||
| urls = arrow_version["urls"], | ||
| strip_prefix = arrow_version["strip_prefix"], | ||
| # TODO: Pin sha256/integrity once release packaging is finalized. | ||
| patch_cmds = _ARROW_PATCH_CMDS, | ||
| build_file_content = _render_arrow_build_file(tag), | ||
| ) |
There was a problem hiding this comment.
The Arrow C++ source is fetched via http_archive from GitHub without a pinned sha256 or other integrity check. If the remote tarball or the network path is compromised, Bazel will transparently build and link against attacker-controlled code in apache_arrow_cpp. Add a sha256 (or equivalent integrity mechanism) for the Arrow archive to ensure the fetched contents match the expected release.
| http_archive( | ||
| name = "apache_arrow_cpp", | ||
| urls = arrow_version["urls"], | ||
| strip_prefix = arrow_version["strip_prefix"], | ||
| # TODO: Pin sha256/integrity once release packaging is finalized. | ||
| patch_cmds = _ARROW_PATCH_CMDS, | ||
| build_file_content = _render_arrow_build_file(tag), | ||
| ) |
There was a problem hiding this comment.
The Arrow C++ source is fetched via http_archive from GitHub without a pinned sha256 or other integrity check. If the remote tarball or the network path is compromised, Bazel will transparently build and link against attacker-controlled code in apache_arrow_cpp. Add a sha256 (or equivalent integrity mechanism) for the Arrow archive to ensure the fetched contents match the expected release.
| FetchContent_Declare( | ||
| apache_arrow_src | ||
| URL ${FLUSS_CPP_ARROW_SOURCE_URL} | ||
| SOURCE_SUBDIR cpp | ||
| ) | ||
| FetchContent_MakeAvailable(apache_arrow_src) |
There was a problem hiding this comment.
FetchContent_Declare(apache_arrow_src ...) pulls the Arrow C++ source from ${FLUSS_CPP_ARROW_SOURCE_URL} without specifying a hash or other integrity verification. If the download is tampered with (e.g., compromised mirror, DNS poisoning, or MITM), CMake will build and link against attacker-controlled Arrow code in your C++ bindings. Add an integrity check (such as URL_HASH with a pinned digest) for the Arrow archive to ensure the downloaded contents are exactly the expected release.
|
@luoyuxia @fresh-borzoni @leekeiabstraction PTAL if u have time. |
There was a problem hiding this comment.
@zhaohaidao On my Mac M1 bazel build failed, let me check if it builds in docker linux
UPD
worked in docker in linux(arm)
fresh-borzoni
left a comment
There was a problem hiding this comment.
@zhaohaidao Ty, looks good overall, left a couple of comments, PTAL
| @@ -1 +1 @@ | |||
| build/ | |||
There was a problem hiding this comment.
wouldn't it match examples/bazel-consumer/build ?
There was a problem hiding this comment.
Good catch. build/ at the top can be confusing here.
!examples/bazel-consumer/** already re-includes this path, but I added explicit exceptions for clarity and safety:
!examples/bazel-consumer/build/!examples/bazel-consumer/build/**
This makes it unambiguous that examples/bazel-consumer/build is intentionally tracked.
1afb46f to
72bf7af
Compare
Thanks, Comments ares addessed. PTAL @fresh-borzoni |
luoyuxia
left a comment
There was a problem hiding this comment.
@zhaohaidao Thanks for the pr. Left minor comments.
one more question, do you think we need to publish fluss-cpp to bazel registry?
| module(name = "fluss_cpp_consumer_build") | ||
|
|
||
| bazel_dep(name = "rules_cc", version = "0.2.14") | ||
| bazel_dep(name = "fluss-cpp", version = "0.0.0") |
There was a problem hiding this comment.
the next release version is 0.1.0? Do you need to change to 0.1.0 in here?
There was a problem hiding this comment.
Here both examples use local_path_override(...), so the source is resolved from the local repository path, not from the registry version. In that setup, 0.0.0 vs 0.1.0 does not change build behavior.
I updated both examples to 0.1.0 for consistency.
| module(name = "fluss_cpp_consumer_system") | ||
|
|
||
| bazel_dep(name = "rules_cc", version = "0.2.14") | ||
| bazel_dep(name = "fluss-cpp", version = "0.0.0") |
There was a problem hiding this comment.
dito:
whether the version should be changed?
| @@ -0,0 +1,284 @@ | |||
| # Fluss C++ Bazel Usage Guide (System / Build Modes) | |||
There was a problem hiding this comment.
we move document to fluss-rust/website/docs/user-guide/cpp/installation.md for cpp. Can we also move these two document to the website. I guess we may need to merge some comments with installation.md.
Maybe we can do in another pr if you find it'll cost some efforts.
There was a problem hiding this comment.
Makes sense. We should move/merge these docs into website/docs/user-guide/cpp/installation.md. To keep this PR focused on Bazel system/build fixes, I suggest doing the website-doc consolidation in a follow-up PR
fresh-borzoni
left a comment
There was a problem hiding this comment.
@zhaohaidao LGTM, TY
Given the current scope, where Arrow is the main external dependency, it may be more practical to keep the source-based build path as the primary approach for now, and consider Bazel registry-first adoption once Bazel is most valuable when:
Bazel also adds clear maintenance overhead for us:
Recommendation now:
|
|
@fresh-borzoni @luoyuxia Comments are addressed. Thanks so much for your partient and carefull reviews. PTAL if u have time. |
Purpose
Linked issue: close #xxx
Brief change log
Tests
API and Format
Documentation