Skip to content

Conversation

@bharatnc
Copy link

@bharatnc bharatnc commented Apr 17, 2025

Follow up for previous PR: #69.

Remove debug validation checks on table. The validation checks mainly fail for debug builds and what they do is to actually validate the metadata for a table. While dealing with indices for dictionaries, it looks like for compatibility, we accept signed or unsigned 32 or 64 bit integers. However, arrow dictionary builder AppendIndices only accepts signed integers. So, we end up appending signed integers while dictionary schema could still be unsigned integers. So, when the validation happens, it checks if the dictionary schema and the actual column data are the same types. Since it's possible to have uint64 schema with int64 columns, this leads to errors like (and similar error for uint32):

Column data for field 0 with type dictionary<values=string, indices=int64,
ordered=0> is inconsistent with schema dictionary<values=string,
indices=uint64, ordered=0>

Unfortunately, I couldn't find a sane way to preserve this behavior since the current behavior exists for compatibility with pandas format it looks like. The settings to support both singed and unsigned dictionary indices was added in: ClickHouse/ClickHouse#58519.

This is where the current assert fails:

RETURN_NOT_OK(ValidateMeta());

which is ultimately checked here:

return Status::Invalid("Column data for field ", i, " with type ",

Remove validation checks on table. The validation checks mainly fail for
debug builds and what they do is to actually validate the metadata for a
table. While dealing with indices for dictionaries, it looks like for
compatibility, we accept signed or unsigned uint32 or uint64 integers.
However, arrow dictionary builder AppendIndices only accepts signed
integers. So, we end up appending signed integers while dictionary
schema could still be unsigned integers. So, when the validation
happens, it checks if the dictionary schema and the actual column data
are the same types. Since it's possible to have uint64 schema with int64
columns, this leads to errors like (and similar error for uint32):
```
Column data for field 0 with type dictionary<values=string, indices=int64,
ordered=0> is inconsistent with schema dictionary<values=string,
indices=uint64, ordered=0>
```
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@bharatnc bharatnc merged commit 210f686 into ClickHouse/release/16.1.0 Apr 17, 2025
7 of 66 checks passed
@bharatnc bharatnc deleted the ncb/remove-asserts branch April 17, 2025 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants