Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-902943]: Add support for pd.NamedAgg in DataFrame and Series.agg #1652

Merged
merged 20 commits into from
May 29, 2024

Conversation

sfc-gh-rdurrani
Copy link
Contributor

@sfc-gh-rdurrani sfc-gh-rdurrani commented May 22, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-902943

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.

Adds support for NamedAggregations for df and series.agg

@sfc-gh-rdurrani sfc-gh-rdurrani requested a review from a team as a code owner May 22, 2024 01:22
@sfc-gh-rdurrani sfc-gh-rdurrani marked this pull request as draft May 22, 2024 01:22
@sfc-gh-rdurrani sfc-gh-rdurrani added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label May 22, 2024
@sfc-gh-rdurrani sfc-gh-rdurrani marked this pull request as ready for review May 23, 2024 00:13
@sfc-gh-rdurrani sfc-gh-rdurrani enabled auto-merge (squash) May 23, 2024 00:19
CHANGELOG.md Outdated Show resolved Hide resolved
tests/integ/modin/frame/test_aggregate.py Show resolved Hide resolved
src/snowflake/snowpark/modin/pandas/base.py Show resolved Hide resolved
src/snowflake/snowpark/modin/pandas/base.py Show resolved Hide resolved
self, allow_duplication=False, axis=axis, is_from_agg=True, **kwargs
)
else:
func = validate_and_try_convert_agg_func_arg_func_to_str(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-joshi with the modin upstreaming work, I think we probably should start moving many of those conversion to backend, just leave basic checking at frontend

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though our frontend implementation of aggregate already diverges pretty heavily from the modin version. We might want to use the extension API to overwrite aggregate when we start the migration.

src/snowflake/snowpark/modin/pandas/groupby.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/modin/pandas/utils.py Show resolved Hide resolved
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed most of the frontend/argument parsing logic, and haven't looked at the query compiler changes yet since I left a few questions about column ordering that I'd like to understand first.

self, allow_duplication=False, axis=axis, is_from_agg=True, **kwargs
)
else:
func = validate_and_try_convert_agg_func_arg_func_to_str(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though our frontend implementation of aggregate already diverges pretty heavily from the modin version. We might want to use the extension API to overwrite aggregate when we start the migration.

src/snowflake/snowpark/modin/pandas/base.py Outdated Show resolved Hide resolved
tests/integ/modin/frame/test_aggregate.py Show resolved Hide resolved
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left a few more minor comments/suggestions.

tests/integ/modin/frame/test_aggregate.py Outdated Show resolved Hide resolved
tests/integ/modin/groupby/test_groupby_named_agg.py Outdated Show resolved Hide resolved
tests/integ/modin/groupby/test_groupby_named_agg.py Outdated Show resolved Hide resolved
tests/integ/modin/groupby/test_groupby_named_agg.py Outdated Show resolved Hide resolved
tests/integ/modin/groupby/test_groupby_named_agg.py Outdated Show resolved Hide resolved
…ompiler.py

Co-authored-by: Jonathan Shi <149419494+sfc-gh-joshi@users.noreply.github.com>
CHANGELOG.md Outdated Show resolved Hide resolved
src/snowflake/snowpark/modin/pandas/base.py Show resolved Hide resolved
# and if not, reorder them.
if uses_named_aggs:
correct_ordering = list(agg_kwargs.keys())
if correct_ordering != new_data_column_pandas_labels:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine for now since most of the code is done in this way now. Ideally, i think we should only do function string conversion at frontend, and then pass them directly to query compiler (the str conversion probably should also go to query backend directly also) when we are getting the column_to_agg_func, it should be in correct order based on the input. pass the order as part of agg_kwargs seems hacky to me.
@sfc-gh-joshi when we do the modin upstreaming support for this, let's talk about this.

src/snowflake/snowpark/modin/pandas/base.py Show resolved Hide resolved
src/snowflake/snowpark/modin/pandas/base.py Show resolved Hide resolved
tests/integ/modin/frame/test_aggregate.py Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@sfc-gh-rdurrani sfc-gh-rdurrani merged commit 6a1bd53 into main May 29, 2024
37 checks passed
@sfc-gh-rdurrani sfc-gh-rdurrani deleted the rdurrani-agg-named branch May 29, 2024 23:08
@github-actions github-actions bot locked and limited conversation to collaborators May 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants