-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-859943: Add basic support for functions.window #2545
base: main
Are you sure you want to change the base?
Conversation
@@ -27,6 +27,7 @@ | |||
- Added support for `Index.to_numpy`. | |||
- Added support for `DataFrame.align` and `Series.align` for `axis=0`. | |||
- Added support for `size` in `GroupBy.aggregate`, `DataFrame.aggregate`, and `Series.aggregate`. | |||
- Added partial support for `snowflake.snowpark.functions.window` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move it up to 1.25.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we call it partial support? lets documents its capabilities to function description and not word it as partial support.
"year": {"year", "y", "yy", "yyy", "yyyy", "yr", "years", "yrs"}, | ||
"quarter": {"quarter", "q", "qtr", "qtrs", "quarters"}, | ||
"month": {"month", "mm", "mon", "mons", "months"}, | ||
"week": {"week", "w", "wk", "weekofyear", "woy", "wy"}, | ||
"day": {"day", "d", "dd", "days", "dayofmonth"}, | ||
"hour": {"hour", "h", "hh", "hr", "hours", "hrs"}, | ||
"minute": {"minute", "m", "mi", "min", "minutes", "mins"}, | ||
"second": {"second", "s", "sec", "seconds", "secs"}, | ||
"millisecond": {"millisecond", "ms", "msec", "milliseconds"}, | ||
"microsecond": {"microsecond", "us", "usec", "microseconds"}, | ||
"nanosecond": { | ||
"nanosecond", | ||
"ns", | ||
"nsec", | ||
"nanosec", | ||
"nsecond", | ||
"nanoseconds", | ||
"nanosecs", | ||
"nseconds", | ||
}, | ||
"dayofweek": {"dayofweek", "weekday", "dow", "dw"}, | ||
"dayofweekiso": {"dayofweekiso", "weekday_iso", "dow_iso", "dw_iso"}, | ||
"dayofyear": {"dayofyear", "yearday", "doy", "dy"}, | ||
"weekiso": {"weekiso", "week_iso", "weekofyeariso", "weekofyear_iso"}, | ||
"yearofweek": {"yearofweek"}, | ||
"yearofweekiso": {"yearofweekiso"}, | ||
"epoch_second": {"epoch_second", "epoch", "epoch_seconds"}, | ||
"epoch_millisecond": {"epoch_millisecond", "epoch_milliseconds"}, | ||
"epoch_microsecond": {"epoch_microsecond", "epoch_microseconds"}, | ||
"epoch_nanosecond": {"epoch_nanosecond", "epoch_nanoseconds"}, | ||
"timezone_hour": {"timezone_hour", "tzh"}, | ||
"timezone_minute": {"timezone_minute", "tzm"}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you do a lot of trial and error for this? Are these aliases documented somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a handy chart here: https://docs.snowflake.com/en/sql-reference/functions-date-time#label-supported-date-time-parts
timeColumn: ColumnOrName, | ||
windowDuration: str, | ||
slideDuration: Optional[str] = None, | ||
startTime: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
camelCase is generally not pythonic. Can we convert this into snake_case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but to maintain compatibility with the pyspark implementation I think camelCase might be required. I could modify so that either version is accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IHMO api style consistency within the snowpark lib outweighs consistency with pyspark.
"snowflake.snowpark.functions.window does not support slideDuration parameter yet." | ||
) | ||
|
||
epoch = lit("1970-01-01 00:00:00").cast(TimestampType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to care about timezone here?
# SNOW-1063685: slideDuration changes this function from a 1:1 mapping to a 1:N mapping. That | ||
# currently would require a udtf which may have significantly different performance. | ||
raise NotImplementedError( | ||
"snowflake.snowpark.functions.window does not support slideDuration parameter yet." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since @sfc-gh-qding volunteered as a sql expert for our team, we can discuss if doing this is possible without udtf.
Co-authored-by: Afroz Alam <afroz.alam@snowflake.com>
# Only function expressions that are a mapping of existing columns can be aggregated on. | ||
# Any increase or reduction in number of rows is an invalid function expression. | ||
if len(materialized_column) == len(child_rf): | ||
child_rf[column_name] = materialized_column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this related to the window function support in local testing as groupby expression?
also I don't understand why we want to materialized_column here -- what's the case it's addressing. is there an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to having any function expression in the groupby. For example:
df2 = df.group_by(upper(df.a)).agg(max_(df.b))
This is perfectly valid in live mode, but does not work in local testing at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have test for the new window function?
I saw pyspark that it has example of using groupby + windows function: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.window.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a doctest example that shows it being used as a groupby expression.
timeColumn: ColumnOrName, | ||
windowDuration: str, | ||
slideDuration: Optional[str] = None, | ||
startTime: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IHMO api style consistency within the snowpark lib outweighs consistency with pyspark.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-859943
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
This PR adds partial support for functions.window. This involved a few notable changes:
window
. These columns are often used as aggregate keys though which cannot be aliased. In order to support this use case I've modified the aggregation logic to remove the alias if present. This has the side-effect of also allowing users to have a statement like this:df.groupby(upper(col("cat")).alias("cat").agg(...)
. The resulting aggregate key column would have the namecat
instead ofUPPER(""CAT"")"
Column
so that if you alias an already aliased column it replaces the alias instead of trying to alias twice. This allows a statement likecol("cat").alias("a1").alias("a2")
which results in a column of namea2
.