-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Not sure whether the existing row index implementation is correct. The documentation is slightly hard to interpret. Particularly these sections from https://orc.apache.org/docs/spec-index.html:
To record positions, each stream needs a sequence of numbers. For uncompressed streams, the position is the byte offset of the RLE run’s start location followed by the number of values that need to be consumed from the run. In compressed streams, the first number is the start of the compression chunk in the stream, followed by the number of decompressed bytes that need to be consumed, and finally the number of values consumed in the RLE.
For columns with multiple streams, the sequences of positions in each stream are concatenated. That was an unfortunate decision on my part that we should fix at some point, because it makes code that uses the indexes error-prone.