Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41692: [Python] Improve substrait extended expressions support #41693

Merged
merged 10 commits into from
Oct 16, 2024

Conversation

amol-
Copy link
Member

@amol- amol- commented May 16, 2024

Addresses some missing features and usability issues when using PyArrow with Substrait ExtendedExpressions

  • Allow passing BoundExpressions for Scanner(columns=X) instead of a dict of expressions.
  • Allow passing BoundExpressions for Scanner(filter=X) so that user doesn't have to distinguish between Expression and BoundExpressions and can always just use pyarrow.substrait.deserialize_expressions
  • Allow decoding pyarrow.BoundExpressions directly from protobuf.Message, thus allowing to use substrait-python objects.
  • Return memoryview from methods encoding substrait, so that those can be directly passed to substrait-python (or more in general other python libraries) without a copy being involved.
  • Allow decoding messages from memoryview so that the output of encoding functions can be sent back to dencoding functions.
  • Allow to encode and decode schemas from substrait
  • When encoding schemas return the extension types required for a substrait consumer to decode the schema
  • Handle arrow extension types when decoding a schema
  • Update docstrings and documentation

Copy link

⚠️ GitHub issue #41692 has been automatically assigned in GitHub to PR creator.

@@ -56,7 +56,7 @@ Status ParseFromBufferImpl(const Buffer& buf, const std::string& full_name,
if (message->ParseFromZeroCopyStream(&buf_stream)) {
return Status::OK();
}
return Status::IOError("ParseFromZeroCopyStream failed for ", full_name);
return Status::Invalid("ParseFromZeroCopyStream failed for ", full_name);
Copy link
Member Author

@amol- amol- Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed odd that it failed with an IOError given that the ArrayInputStream was built from already in-memory Buffer, so no IO was involved. If it fails, it actually means that the Substrait data is invalid

@github-actions github-actions bot added awaiting committer review Awaiting committer review Component: Documentation and removed awaiting review Awaiting review labels Jul 3, 2024
@amol-
Copy link
Member Author

amol- commented Jul 3, 2024

The failures seem to be related to #42149 and #43134

@amol-
Copy link
Member Author

amol- commented Jul 4, 2024

Marking as ready for review as the failures seem to be unrelated, @jorisvandenbossche would you mind reviewing this when you have the chance?

@amol- amol- marked this pull request as ready for review July 4, 2024 13:14
@amol- amol- requested a review from westonpace as a code owner July 4, 2024 13:14
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though my knowledge of this is limited I'll take some time to review tomorrow. Would you mind rebasing to see if the failing CI issues are solved (I haven't seen a bunch of those jobs failing for a while).
I plan to start 18.0.0 next Wednesday and would love to ensure this won't break CI.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Oct 6, 2024
@amol-
Copy link
Member Author

amol- commented Oct 7, 2024

👍 rebased @raulcd

@github-actions github-actions bot added the awaiting change review Awaiting change review label Oct 7, 2024
@raulcd
Copy link
Member

raulcd commented Oct 7, 2024

@github-actions crossbow submit preview-docs

Copy link

github-actions bot commented Oct 7, 2024

Revision: d24f0f0

Submitted crossbow builds: ursacomputing/crossbow @ actions-1e09098ee5

Task Status
preview-docs GitHub Actions

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though my knowledge of this is limited I'll take some time to review tomorrow.

I've tried to review but to be honest my knowledge around this is pretty limited so in general the change LGTM but would let you / others decide whether to merge this.

CI and docs look good to me

docs/source/python/integration/substrait.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 7, 2024
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 7, 2024
@amol-
Copy link
Member Author

amol- commented Oct 16, 2024

going to merge this as it shouldn't impact user base at the moment and it didn´t get further comments. We can revise if there are concerns as there is time before v19

@amol- amol- merged commit e0ab40d into apache:main Oct 16, 2024
37 of 39 checks passed
@amol- amol- deleted the GH-41692 branch October 16, 2024 12:58
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit e0ab40d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants