Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44232: [Python] Validate __arrow_c_array__ length for scalar construction #44434

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Lukedowling444
Copy link

@Lukedowling444 Lukedowling444 commented Oct 16, 2024

Rationale for this change

In the current implementation of the pyarrow.scalar constructor, when an input has the arrow_c_array interface, there is no check to ensure the length of the array is exactly 1. This can lead to unexpected behavior or invalid data being accepted for scalar construction.

This change introduces a validation step that checks whether the input array has a length of 1 when using the arrow_c_array interface, ensuring that only valid inputs are processed. This improves the reliability of Arrow when handling data from multiple libraries.

An existing workaround involves calling pa.array(other_scalar).slice(0, 1) to achieve the same result, but this change eliminates the need for such a workaround.

Looking to close #44232

What changes are included in this PR?

A validation check was added inside the pyarrow.scalar constructor to ensure that when an input with the arrow_c_array interface is provided, it must be a length-1 array.
The constructor now raises a ValueError if this condition is not met.

Are these changes tested?

Yes, the project builds successfully, and all existing test cases pass without any issues.

Additionally, I have included new test cases to specifically check the behavior of this change in python/scripts/test_scalar.py. These tests ensure that the pyarrow.scalar constructor raises a ValueError when a non-length-1 array is provided and continues to work correctly for valid inputs.

Are there any user-facing changes?

No user-facing changes are introduced, and this validation should not affect any workflows where inputs to pyarrow.scalar are already correct.

GitHub Issue: #44232

@Isaac7777-cpu-school
Copy link

Hi @lukedowling44. This looks awsome. However, I think you might have accidentally commited your build folder (\dist) and your virtual environment (\pyarrow-dev) as well. I think these files are better not get committed to the git repository.

@piratepanda805
Copy link

piratepanda805 commented Oct 16, 2024

Hi @lukedowling44. This looks awsome. However, I think you might have accidentally commited your build folder (\dist) and your virtual environment (\pyarrow-dev) as well. I think these files are better not get committed to the git repository.

To clarify, make sure to only commit your tests and scalar.pxi changes.

@Lukedowling444 Lukedowling444 force-pushed the 44232-python-pycapsule-pyarrow-scalar-constructor branch from 3ff3c08 to 9a67c88 Compare October 18, 2024 05:30
@github-actions github-actions bot added the awaiting review Awaiting review label Oct 18, 2024
@Lukedowling444 Lukedowling444 force-pushed the 44232-python-pycapsule-pyarrow-scalar-constructor branch from 9a67c88 to 4ee815f Compare October 18, 2024 05:40
@Lukedowling444 Lukedowling444 force-pushed the 44232-python-pycapsule-pyarrow-scalar-constructor branch from 4ee815f to 45b3697 Compare October 18, 2024 05:44
@Isaac7777-cpu-school
Copy link

@lukedowling44 This looks perfect now and have fixed the previous issue. I would like to ask why most of the statements inside the if-clause is quite similar to the execution statements in the following.

@Lukedowling444
Copy link
Author

@Isaac7777-cpu-school You are right, I was repeating code unnecessarily. I have made changes and pushed again.

@Isaac7777-cpu-school
Copy link

@Lukedowling444 Nice, all the dead codes are removed. I think it is pretty much ready for the PR. However, just a minor thing and this may not make sense as much - why do we need another local variable for the carray rather than directly put it in the if condition? However, since this is really just a styling issue, please also consult with the maintainer for this as well?

@Lukedowling444
Copy link
Author

@Isaac7777-cpu-school Thanks for the feedback. Cython doesn’t allow cdef statements directly within conditional logic or non-function scopes, so I had to define it before the condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Allow PyCapsule Interface in pyarrow.scalar constructor?
4 participants