Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Implement basic functionality for new Binary/Utf8View types #39634

Open
3 of 6 tasks
jorisvandenbossche opened this issue Jan 16, 2024 · 7 comments
Open
3 of 6 tasks

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 16, 2024

PR #37792 added a basic implementation of the new BinaryView and Utf8View (StringView) format additions. This included mostly just the basic objects (Array, DataType, Builder) and IPC support.
For being able to handle input that uses those new formats, there are some additional features we will have to support (eg receiving a dataset, doing some filtering/projection, and then converting to standard string type).

@mapleFU
Copy link
Member

mapleFU commented Oct 28, 2024

https://github.com/apache/arrow/pull/43302/files

This Cast is implemented in C++, I don't know what should be added then 🤔

@jorisvandenbossche
Copy link
Member Author

Updated the top post to indicate cast is implemented now.

@kevinli1993
Copy link

kevinli1993 commented Oct 28, 2024

Hi! I found this issue after creating #44540 above - thanks Joris for organizing/triaging.

I'm wondering if the cast functionality will be made available in pyarrow in a later release? Or we focusing on polishing up the remaining C++ tasks first?

On another note, another missing functionality I might like to see added is a count_unique kernel for Utf8View. (Hopefully not a big lift as the dictionary implementation already exists.)

@mapleFU
Copy link
Member

mapleFU commented Oct 28, 2024

I'm wondering if the cast functionality will be made available in pyarrow in a later release? Or we focusing on polishing up the remaining C++ tasks first?

Which version are you using here? I'd try 18.0 for this later, and just support cast if it cannot.

@WillAyd
Copy link
Contributor

WillAyd commented Oct 29, 2024

I opened #44336 to track the development of compute kernels - should we just roll that into this PR or keep separate?

@jorisvandenbossche
Copy link
Member Author

Either way is fine for me. In addition to the "basic" kernels (cast, take, filter, etc), there are also all the binary or string-specific kernels (binary_../ascii_../utf8_..)

@kevinli1993
Copy link

I'm wondering if the cast functionality will be made available in pyarrow in a later release? Or we focusing on polishing up the remaining C++ tasks first?

Which version are you using here? I'd try 18.0 for this later, and just support cast if it cannot.

I could confirm that casting to pa.string() works very well now, on the 18.0 version of pyarrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants