-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1752856: Implement DataFrame/Series align for axis = 1 and None #2541
base: main
Are you sure you want to change the base?
SNOW-1752856: Implement DataFrame/Series align for axis = 1 and None #2541
Conversation
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
return frame1, frame2 | ||
|
||
|
||
def _select_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a unit test for _select_columns ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually I believe internal util functions don't have unit tests, such as concat_utils._select_columns but I can add one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, just left some questions + style comments.
# try to match as many columns as possible from the frames. | ||
label_count_map: dict[Hashable, int] = {} | ||
for label, id_tuple in zip(data_column_labels, snowflake_ids): | ||
if len(id_tuple) <= label_count_map.get(label, 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by what this check is doing. My understanding is:
id_tuple
is all the SF quoted identifiers corresponding to the pandas labellabel_count_map
maps pandas labels to the number of times we've encountered it within a frame
Is len(id_tuple) <= label_count_map.get(label, 0)
just checking whether this is the first time we've encountered this label in the loop? Why are we choosing to add a NaN column when that's the case, and adding the original column of any duplicates we encounter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(id_tuple) <= label_count_map.get(label, 0)
is true when a label does not exist in the original frame (for example when frame has columns A, B, B, D and data_column_labels=[A, B, C, D]. The snowflake_ids is [("A",), ("B", "B_7wi3"), ("B", "B_7wi3"), (), ("D",)]
So the id_tuple for C would be () with len 0 which is <= label_count_map.get(C, 0). In this case, we would add a NaN column for label C.
"Snowpark pandas doesn't support `align` with MultiIndex" | ||
) | ||
if axis is not None: | ||
if self.is_multiindex(axis=axis) or other._query_compiler.is_multiindex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we raise this error when axis=None
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Also, do we have tests for multi index with axis=None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frame.is_multiindex has a check in num_index_levels
that asserts axis = 0 or 1. I'm not sure if this is intended or not, but I would run into the ValueError when axis=None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-helmeleegy Oh not yet, I'll add the neg multiindex tests with axis 1 and None
if join == "outer": | ||
left_frame, right_frame = align_axis_1(frame, other_frame, join, True) | ||
else: | ||
left_frame, right_frame = align_axis_1(frame, other_frame, join, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if join == "outer": | |
left_frame, right_frame = align_axis_1(frame, other_frame, join, True) | |
else: | |
left_frame, right_frame = align_axis_1(frame, other_frame, join, False) | |
should_sort = join == "outer" | |
left_frame, right_frame = align_axis_1(frame, other_frame, join, should_sort) |
A little bit shorter and easier to read at a glance. Might also want to add a comment explaining that we preserve key order for non-outer joins.
if join == "outer": | ||
left_frame_1, right_frame_1 = align_axis_1( | ||
frame, other_frame, join, True | ||
) | ||
else: | ||
left_frame_1, right_frame_1 = align_axis_1( | ||
frame, other_frame, join, False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if join == "outer": | |
left_frame_1, right_frame_1 = align_axis_1( | |
frame, other_frame, join, True | |
) | |
else: | |
left_frame_1, right_frame_1 = align_axis_1( | |
frame, other_frame, join, False | |
) | |
should_sort = join == "outer" | |
left_frame_1, right_frame_1 = align_axis_1(frame, other_frame, join, should_sort) |
right_frame_data_ids, | ||
right_index_ids, | ||
) = align_axis_0_right(left_frame_1, right_frame_1, join) | ||
left_qc = SnowflakeQueryCompiler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question not really in the scope of this PR: why don't align_axis_0_left
/right
return an InternalFrame
directly?
is_lhs_dataframe_and_rhs_series = ( | ||
True | ||
if isinstance(self, pd.DataFrame) and isinstance(other, pd.Series) | ||
else False | ||
) | ||
is_lhs_series_and_rhs_dataframe = ( | ||
True | ||
if isinstance(self, pd.Series) and isinstance(other, pd.DataFrame) | ||
else False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_lhs_dataframe_and_rhs_series = ( | |
True | |
if isinstance(self, pd.DataFrame) and isinstance(other, pd.Series) | |
else False | |
) | |
is_lhs_series_and_rhs_dataframe = ( | |
True | |
if isinstance(self, pd.Series) and isinstance(other, pd.DataFrame) | |
else False | |
) | |
is_lhs_dataframe_and_rhs_series = isinstance(self, pd.DataFrame) and isinstance(other, pd.Series) | |
is_lhs_series_and_rhs_dataframe = isinstance(self, pd.Series) and isinstance(other, pd.DataFrame) |
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
sort: Optional[bool] = False, | ||
) -> tuple[InternalFrame, InternalFrame]: | ||
""" | ||
Concatenate frames on index axis by taking using UNION operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concatenate frames on index axis by taking using UNION operator. | |
Concatenate frames on index axis using UNION operator. |
data_column_labels: A list of pandas labels. | ||
|
||
Returns: | ||
New InternalFrame after only with given data columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New InternalFrame after only with given data columns. | |
New InternalFrame after selecting only the given data columns. |
) | ||
# Add data columns | ||
data_column_snowflake_identifiers = [] | ||
# A map to keep track number of times a label is already seen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# A map to keep track number of times a label is already seen. | |
# A map to keep track of the number of times a label is already seen. |
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1752856
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Adding support for DataFrame/Series align for axis = 1 and None.