-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Request to move ChunkResolver to public API #34535
Comments
If it's going to be a public API then I think it would be good to have some tests for it. Otherwise, this seems like a reasonable utility to make available. |
Hey @anjakefala and @westonpace! Is this something that would still be reasonable to work on? I'd like to work on this if so, and my plan would be:
|
Oh thanks for letting me know @SChakravorti21! It absolutely would be. @felipecrv has recently been putting love into the Please give us a poke here when you have a PR open. =) |
@SChakravorti21 I had an idea since my changes to the class interface that you could add maybe? I added inline ChunkLocation ResolveWithChunkIndexHint(int64_t index,
int64_t cached_chunk_index) const { I was planning to add a inline ChunkLocation ResolveWithHint(int64_t index, ChunkLocation hint) const {
assert(hint.chunk_index < static_cast<int64_t>(offsets_.size()));
const auto chunk_index =
ResolveChunkIndex</*StoreCachedChunk=*/false>(index, hint.chunk_index);
return {chunk_index, index - offsets_[chunk_index]};
} |
@felipecrv Sounds like a reasonable change, I agree that it looks easier to use! Will let you folks know when I've opened a PR 👍🏼 |
Hey folks, just wanted to give a heads up that I've created a draft PR at #40226 for the work on this so far. I have a question: I noticed that there is also a |
Hmm, it's better to keep That and the fact that you would have to type the functions based on all the types that are used to represent an array of array chunks. |
Sorry if I wasn't clear - the /// BEGIN: Copied from `chunked_internal.h`
template <typename ArrayType>
struct ResolvedChunk {
using ViewType = GetViewType<typename ArrayType::TypeClass>;
using LogicalValueType = typename ViewType::T;
const ArrayType* array;
const int64_t index;
ResolvedChunk(const ArrayType* array, int64_t index) : array(array), index(index) {}
bool IsNull() const { return array->IsNull(index); }
LogicalValueType Value() const { return ViewType::LogicalValue(array->GetView(index)); }
};
template <>
struct ResolvedChunk<Array> {
const Array* array;
const int64_t index;
ResolvedChunk(const Array* array, int64_t index) : array(array), index(index) {}
bool IsNull() const { return array->IsNull(index); }
};
/// END: Copied from `chunked_internal.h`
struct ChunkLocation {
int64_t chunk_index;
int64_t index_in_chunk;
};
template <typename ArrayType = Array>
class ChunkResolver {
public:
inline ResolvedChunk<ArrayType> Resolve(int64_t index) const;
inline ChunkLocation ResolveLocation(int64_t index) const;
inline ChunkLocation ResolveLocationWithHint(ChunkLocation hint) const;
private:
std::vector<int64_t> offsets_;
mutable std::atomic<int64_t> cached_chunk_;
// This contains `shared_ptr`s to simplify usage for some applications (to avoid
// having to hold onto the original Arrays), but we would only return raw pointers
// to avoid the reference counting overhead of copying `shared_ptr`s.
std::vector<std::shared_ptr<Array>> chunks_;
}; I agree that maintaining the performance of
I'm sorry, I'm not quite sure I understood this part. Could you clarify? |
The resolution depends on what chunked thing you're targetting: explicit ChunkResolver(const ArrayVector& chunks);
explicit ChunkResolver(const std::vector<const Array*>& chunks);
explicit ChunkResolver(const RecordBatchVector& batches); This is the problem (code from your proposal): template <typename ArrayType = Array>
class ChunkResolver { Now it becomes impossible to have a non-inline function [1] in the [1] I have a plan to add batched/vectorized versions of |
Edit: on second read, I think your concern might be more targeted towards compile times rather than performance, is that right? In that case, we can have the class ChunkResolver {
public:
template <typename ArrayType = Array>
inline ResolvedChunk<ArrayType> Resolve(int64_t index) const;
inline ChunkLocation ResolveLocation(int64_t index) const;
inline ChunkLocation ResolveLocationWithHint(ChunkLocation hint) const;
private:
std::vector<int64_t> offsets_;
mutable std::atomic<int64_t> cached_chunk_;
std::vector<std::shared_ptr<Array>> chunks_;
}; This way we can add other methods in the future with definitions in |
Not primarily, binary size is the primary concern. As it is now, we don't have to inline the
You can have crashes if the user passes the wrong type parameter to the method call and in your solution, you're adding There is a way here based on composition that you will probably like. We can bring most of I recommend you to do that on a separate PR as template <typename ArrayType = std::shared_ptr<Array>>
class ArrayChunkResolver {
public:
using ArrayPtr = /* template thingie that gives you the raw-est pointer type from `ArrayType` */;
private:
ChunkResolver _resolver; // owns a vector of chunk offsets and assumes chunk sizes are immutable
std::vector<ArrayType> &chunks_; // the resolver doesn't own the vector of chunks
public:
// ctor, move-ctor, and move-assign.
inline ArrayPtr Resolve(int64_t index) const;
inline ChunkLocation ResolveLocation(int64_t index) const; // delegate to _resolver
inline ChunkLocation ResolveLocationWithHint(ChunkLocation hint) const; // delegate to _resolver
}; |
@SChakravorti21 scratch that. I had a better idea (even simpler) and will be pushing a PR soon. I have the code ready on my machine. |
First step: #40281 In a following step, I might add a template param to |
Hey @felipecrv! Sorry I have been unresponsive the last few days, was busy with work and couldn't find much free time outside of work. Everything you have said makes sense to me. I took a look at #40281 and the changes there look great to me.
So just to clarify, is the plan to move both The std::shared_ptr<arrow::Table> table;
arrow::ChunkedArrayResolver resolver_a(table.GetColumnByName("a"));
arrow::ChunkedArrayResolver resolver_b(table.GetColumnByName("b"));
for (int i = 0; i < table->num_rows(); ++i)
{
std::int64_t a = resolver_a.Resolve(i).Value<arrow::Int64Type>();
std::string_view b = resolver_b.Resolve(i).Value<arrow::StringType>();
do_business_logic(a, b);
} |
Not my plan, but after you mentioned you wanted to unify If it were to go public, that would be done in another issue/PR pair, so for now it's better to focus just on making About the code: for (int i = 0; i < table->num_rows(); ++i)
{
std::int64_t a = resolver_a.Resolve(i).Value<arrow::Int64Type>();
std::string_view b = resolver_b.Resolve(i).Value<arrow::StringType>();
do_business_logic(a, b);
} It's important to delay the re-construction of row-by-row values as much as possible to preserve the benefits of columnar layouts. So there is a danger [1] in exposing too many APIs that work value-by-value instead or array-by-array. Array-by-array is not compatible with how most people think about programming. The loop above is performing sequential access, so using the chunk resolver is not the best solution regarding locality of the memory accesses [2]. It's better to keep one [1] Apache Arrow, as a columnar data library, is built to keep most computation on top of columnar representation, you're of course allowed to do whatever you need to solver your application problem, but Arrow itself exposing row-by-row APIs would pass the wrong message |
Gotcha. Yes, I agree, we can focus on just
I agree 100% that we don't want to send the wrong message, and that people should always try to frame their logic in terms of vectorized operations. That said, there are practical use-cases where there is no way of getting around row-major processing of the data. I've been thinking about this and came up with an alternative way that may be better, and would be interested to hear your thoughts on it (pseudocode): for (auto maybe_batch : arrow::TableBatchReader(*table))
{
std::shared_ptr<arrow::RecordBatch> batch = maybe_batch.ValueOrDie();
// User decides whether they want a safe or unsafe cast
std::shared_ptr<arrow::Int64Array> a = std::dynamic_pointer_cast<arrow::Int64Array>(batch->GetColumnByName("a"));
std::shared_ptr<arrow::StringArray> b = std::dynamic_pointer_cast<arrow::StringArray>(batch->GetColumnByName("b"));
for (int i = 0; i < batch->num_rows(); ++i) {
do_business_logic(a.Value(i), b.Value(i));
}
} This is still sequential but (I think) avoids a lot of unnecessary overhead. In that case, it might be good enough to make |
There is a case for making Without knowing what But my main problem with your code (and let me be more direct this time with what I mean by "random access") is that you're using To that, we can add these two to /// \pre loc.chunk_index >= 0
/// \pre loc.index_in_chunk is assumed valid if chunk_index is not the last one
inline bool Valid(ChunkLocation loc) const {
const int64_t last_chunk_index = static_cast<int64_t>(offsets_.size()) - 1;
return loc.chunk_index + 1 < last_chunk_index ||
(loc.chunk_index + 1 == last_chunk_index &&
loc.index_in_chunk < offsets_[last_chunk_index]);
}
/// \pre Valid(loc)
inline ChunkLocation Next(ChunkLocation loc) const {
const int64_t next_index_in_chunk = loc.index_in_chunk + 1;
return (next_index_in_chunk < offsets_[loc.chunk_index + 1])
? ChunkLocation{loc.chunk_index, next_index_in_chunk}
: ChunkLocation{loc.chunk_index + 1, 0};
} Then your loops can be: ChunkResolver resolver(batches);
for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
// re-use loc for all the typed columns since they are split on the same offsets
} |
For sure, I can make a separate issue for
That makes sense, I didn't fully understand what you meant previously. I think the API additions you're suggesting make sense, but I'm confused how someone would use them to iterate over multiple columns simultaneously. Is there such a thing as a "typed ChunkResolver resolver(batches);
for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
// what is the most efficient way to access the values for each column here?
} The benefit of just iterating over the batches themselves is that we only perform the cast from untyped |
for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
// what is the most efficient way to access the values for each column here?
// I'm not sure this is the most efficient, but it's certainly better than using ChunkedArrayResolver for every array.
// don't forget the null checks (missing here)
int64_t a = checked_cast<Int64Array>(chunks[loc.chunk_index]).Value(loc.index_in_chunk);
std::string_view b = checked_cast<StringArray>(chunks[loc.chunk_index]).Value(loc.index_in_chunk);
process_row_oh_no(a, b);
}
Your approach works even better and doesn't need the
Well... documenting won't deter people from using it. See how much we had to discuss here so this nuance could be understood. The better plan is to implement things that need random access inside Arrow (sort, ranking, take, filter, joins...) and have users use that instead of reach to random access directly. |
…ts (#41561) ### Rationale for this change I want `ResolveMany` to support me in the implementation of `Take` that doesn't `Concatenate` all the chunks from a `ChunkedArray` `values` parameter. ### What changes are included in this PR? - Implementation of `ChunkResolver::ResolveMany()` - Addition of missing unit tests for `ChunkResolver` ### Are these changes tested? Yes. By new unit tests. ### Are there any user-facing changes? No. `ChunkResolver` is an internal API at the moment (see #34535 for future plans). * GitHub Issue: #41560 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
…it tests (apache#41561) ### Rationale for this change I want `ResolveMany` to support me in the implementation of `Take` that doesn't `Concatenate` all the chunks from a `ChunkedArray` `values` parameter. ### What changes are included in this PR? - Implementation of `ChunkResolver::ResolveMany()` - Addition of missing unit tests for `ChunkResolver` ### Are these changes tested? Yes. By new unit tests. ### Are there any user-facing changes? No. `ChunkResolver` is an internal API at the moment (see apache#34535 for future plans). * GitHub Issue: apache#41560 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Co-authored-by: SChakravorti21
Co-authored-by: SChakravorti21<schakravorti@bloomberg.net>
### Rationale for this change Adopting #40226. The creation and return of a shared_ptr does result in some performance overhead, that makes a difference for a performance-sensitive application. If someone could use ChunkResolver to learn the indices, they could then instead access the data directly. ### What changes are included in this PR? - [X] Updates to documentation (thanks to @ SChakravorti21 ) - [X] Moving `ChunkResolver` to public API, and updating all references to it in the code ### Are these changes tested? There seemed to be comprehensive tests already: https://github.com/apache/arrow/blob/main/cpp/src/arrow/chunked_array_test.cc#L324 If an edgecase is missing, I'd be happy to add it. ### Are there any user-facing changes? `ChunkResolver` and `TypedChunkLocation` are now in the public API. * GitHub Issue: #34535 Lead-authored-by: Anja Kefala <anja.kefala@gmail.com> Co-authored-by: anjakefala <anja.kefala@gmail.com> Co-authored-by: Bryce Mecum <petridish@gmail.com> Co-authored-by: SChakravorti21 <schakravorti@bloomberg.net> Co-authored-by: SChakravorti21<schakravorti@bloomberg.net> Co-authored-by: Jacob Wujciak-Jens <jacob@wujciak.de> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Issue resolved by pull request 44357 |
Sorry, moving this to 19.0.0 because we might not do more RCs for 18.0.0. I've added the backport-candidate label in case we do one new RC but doesn't seem we will. |
Describe the enhancement requested
ChunkResolver.Resolve is used by
ChunkedArray.getStatic()
in order to identify which chunk, and which index into that chunk, correlate with a given index into the whole array.getStatic
then does additional work to wrap the value into aResult<std::shared_ptr<Scalar>>
. The creation and return of ashared_ptr
does result in some performance overhead, that makes a difference for a performance-sensitive application.If someone could use
ChunkResolver
to learn the indices, they could then instead access the data directly. This would provide a more efficient way of indexing into aChunkedArray
, should an application benefit from it.Would it be possible to move
ChunkResolver
to public API?Component(s)
C++
The text was updated successfully, but these errors were encountered: