Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex #11663

Open
wants to merge 7 commits into
base: branch-24.12
Choose a base branch
from

Conversation

SurajAralihalli
Copy link
Collaborator

@SurajAralihalli SurajAralihalli commented Oct 25, 2024

Resolves #11554, #7585

In cuDF, support for multiple newline characters was expanded from NEW_LINE (\n) to include the following:

  • NEXT_LINE (\u0085)
  • LINE_SEPARATOR (\u2028)
  • PARAGRAPH_SEPARATOR (\u2029)
  • CARRIAGE_RETURN (\r)
  • NEW_LINE (\n)

PR #17139 introduced this change to cuDf JNI with RegexFlag::EXT_LINE. This PR simplifies the transpilation of $ by changing the pattern from (?:\r|\u0085|\u2028|\u2029|\r\n)?$ to the simpler (?:\r\n)?$ and updates all functions to use RegexFlag::EXT_LINE wherever this transpilation occurs.

This PR also drops support for $\z because \z is not supported by cuDf. Alternatively, we could transpile $\z to $(?![\r\n\u0085\u2028\u2029]). However, cuDf doesn't support negative look ahead.

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@NVnavkumar
Copy link
Collaborator

Can we confirm some of the behavior described in compatibility.md and update accordingly?

Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
Signed-off-by: Suraj Aralihalli <suraj.ara16@gmail.com>
@SurajAralihalli
Copy link
Collaborator Author

Can we confirm some of the behavior described in compatibility.md and update accordingly?

Thank you for pointing it, I found another issue that is resolved by this PR. I've updated the guide and tests to reflect this.
As part of the process we also reviewed the feasibility of solving #10641 and #10764 in this PR. Updated these issues with the status.

@SurajAralihalli SurajAralihalli marked this pull request as ready for review October 28, 2024 22:31
@SurajAralihalli
Copy link
Collaborator Author

Build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Reimplement $ transpilation using cuDF new line terminator support
2 participants