-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41560: [C++] ChunkResolver: Implement ResolveMany and add unit tests #41561
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be some (unwarranted? gratuitous?) complexity here. Also, it's not obvious why this would be better than calling Resolve multiple times in a row.
cpp/src/arrow/chunk_resolver.h
Outdated
int64_t index_in_chunk = 0; | ||
|
||
/// \brief Create a ChunkLocation without asserting any preconditions. | ||
static ChunkLocation Forge(int64_t chunk_index, int64_t index_in_chunk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid inventing terminology that's currently not used in the project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your actionable suggestion? I'm trying to make this API less error-prone. MakeUnsafe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an actual need for a checked constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It adds a layer of safety and by poisoning the index_in_chunk
value when chunk_index
is invalid, I (and other users of the library) are more likely to detect unguarded use of chunk_index
this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how such unguarded uses could occur. The primary use case for ChunkLocation
is as a return value for ChunkResolver
. Unless the user has passed an invalid logical index, ChunkResolver
should always return valid ChunkLocation
s.
You can comment on what you think is "unwarranted" and "gratuitous" complexity. This kind of feedback is demotivating and not actionable. I try to simplify the code before opening a PR instead of pushing the first thing that works.
I tried that in the
In addition to that, |
I did post specific comments :-) |
I should have done this renaming in a previous PR. Correcting it now.
After learning that GCC can't compile the previous code. Well... this is simpler now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the number of back-and-forths. Here are a number of followup questions and comments, but feel free to merge in any case.
cpp/src/arrow/chunk_resolver.h
Outdated
|
||
/// \brief Resolve `n` logical indices to chunk indices. | ||
/// | ||
/// \pre 0 <= logical_index_vec[i] < n (for well-defined and valid chunk index results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean logical_array_length()
rather than n
?
cpp/src/arrow/chunk_resolver.h
Outdated
/// | ||
/// \pre 0 <= logical_index_vec[i] < n (for well-defined and valid chunk index results) | ||
/// \pre out_chunk_index_vec has space for `n_indices` | ||
/// \post chunk_hint in [0, chunks.size()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it rather a pre-condition?
/// \param n_indices The number of logical indices to resolve | ||
/// \param logical_index_vec The logical indices to resolve | ||
/// \param out_chunk_index_vec The output array where the chunk indices will be written | ||
/// \param chunk_hint 0 or the last chunk_index produced by ResolveMany |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning the caller is supposed to pass the value of out_chunk_index_vec[n_indices - 1]
from the previous call to ResolveMany
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's the plan. Or 0 if n_indices == 0
. That's why I didn't want to write the formula.
/// \param chunk_hint 0 or the last chunk_index produced by ResolveMany | ||
/// \param out_index_in_chunk_vec If not NULLPTR, the output array where the | ||
/// within-chunk indices will be written | ||
/// \return false iff chunks.size() > std::numeric_limits<IndexType>::max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming ResolveMany
will be invoked in batches? This condition doesn't need to be checked in each ResolveMany
call, only once for each call to the larger operation (such as Take). That's not necessarily a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming ResolveMany will be invoked in batches?
It could be. I'm not doing it like that for Take
at the moment. It's a very predictable branch though and I'm allowing it to be inlined at the caller.
cpp/src/arrow/chunk_resolver.h
Outdated
// | ||
// Negative logical indices can become large values when cast to unsigned, but | ||
// they are gracefully handled by ResolveManyImpl. Although both the chunk index | ||
// and the index in chunk values will be undefined in these cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't seem to be the case: if a logical index is negative, its unsigned counterpart will be out of bounds and the corresponding out_chunk_index_vec
value will therefore be equal to chunks_.size()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. The overflow check guarantees it's impossible for negative logical indices to become valid indices.
Not really. INT8_MIN
becomes 128 when cast to uint8_t
, -1
becomes 255
, so depending on the chunks, they won't be an out-of-bounds logical indices.
I'm tweaking the comment here and improving the tests.
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit e04f5b4. There were 7 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 16 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…it tests (apache#41561) ### Rationale for this change I want `ResolveMany` to support me in the implementation of `Take` that doesn't `Concatenate` all the chunks from a `ChunkedArray` `values` parameter. ### What changes are included in this PR? - Implementation of `ChunkResolver::ResolveMany()` - Addition of missing unit tests for `ChunkResolver` ### Are these changes tested? Yes. By new unit tests. ### Are there any user-facing changes? No. `ChunkResolver` is an internal API at the moment (see apache#34535 for future plans). * GitHub Issue: apache#41560 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Rationale for this change
I want
ResolveMany
to support me in the implementation ofTake
that doesn'tConcatenate
all the chunks from aChunkedArray
values
parameter.What changes are included in this PR?
ChunkResolver::ResolveMany()
ChunkResolver
Are these changes tested?
Yes. By new unit tests.
Are there any user-facing changes?
No.
ChunkResolver
is an internal API at the moment (see #34535 for future plans).