Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1752856: Implement DataFrame/Series align for axis = 1 and None #2541

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-lmukhopadhyay
Copy link
Contributor

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1752856

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

Adding support for DataFrame/Series align for axis = 1 and None.

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
return frame1, frame2


def _select_columns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a unit test for _select_columns ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually I believe internal util functions don't have unit tests, such as concat_utils._select_columns but I can add one!

Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, just left some questions + style comments.

# try to match as many columns as possible from the frames.
label_count_map: dict[Hashable, int] = {}
for label, id_tuple in zip(data_column_labels, snowflake_ids):
if len(id_tuple) <= label_count_map.get(label, 0):
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by what this check is doing. My understanding is:

  • id_tuple is all the SF quoted identifiers corresponding to the pandas label
  • label_count_map maps pandas labels to the number of times we've encountered it within a frame

Is len(id_tuple) <= label_count_map.get(label, 0) just checking whether this is the first time we've encountered this label in the loop? Why are we choosing to add a NaN column when that's the case, and adding the original column of any duplicates we encounter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(id_tuple) <= label_count_map.get(label, 0) is true when a label does not exist in the original frame (for example when frame has columns A, B, B, D and data_column_labels=[A, B, C, D]. The snowflake_ids is [("A",), ("B", "B_7wi3"), ("B", "B_7wi3"), (), ("D",)]

So the id_tuple for C would be () with len 0 which is <= label_count_map.get(C, 0). In this case, we would add a NaN column for label C.

"Snowpark pandas doesn't support `align` with MultiIndex"
)
if axis is not None:
if self.is_multiindex(axis=axis) or other._query_compiler.is_multiindex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we raise this error when axis=None as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Also, do we have tests for multi index with axis=None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frame.is_multiindex has a check in num_index_levels that asserts axis = 0 or 1. I'm not sure if this is intended or not, but I would run into the ValueError when axis=None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-helmeleegy Oh not yet, I'll add the neg multiindex tests with axis 1 and None

Comment on lines +8515 to +8518
if join == "outer":
left_frame, right_frame = align_axis_1(frame, other_frame, join, True)
else:
left_frame, right_frame = align_axis_1(frame, other_frame, join, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if join == "outer":
left_frame, right_frame = align_axis_1(frame, other_frame, join, True)
else:
left_frame, right_frame = align_axis_1(frame, other_frame, join, False)
should_sort = join == "outer"
left_frame, right_frame = align_axis_1(frame, other_frame, join, should_sort)

A little bit shorter and easier to read at a glance. Might also want to add a comment explaining that we preserve key order for non-outer joins.

Comment on lines +8523 to +8530
if join == "outer":
left_frame_1, right_frame_1 = align_axis_1(
frame, other_frame, join, True
)
else:
left_frame_1, right_frame_1 = align_axis_1(
frame, other_frame, join, False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if join == "outer":
left_frame_1, right_frame_1 = align_axis_1(
frame, other_frame, join, True
)
else:
left_frame_1, right_frame_1 = align_axis_1(
frame, other_frame, join, False
)
should_sort = join == "outer"
left_frame_1, right_frame_1 = align_axis_1(frame, other_frame, join, should_sort)

right_frame_data_ids,
right_index_ids,
) = align_axis_0_right(left_frame_1, right_frame_1, join)
left_qc = SnowflakeQueryCompiler(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question not really in the scope of this PR: why don't align_axis_0_left/right return an InternalFrame directly?

Comment on lines +898 to +907
is_lhs_dataframe_and_rhs_series = (
True
if isinstance(self, pd.DataFrame) and isinstance(other, pd.Series)
else False
)
is_lhs_series_and_rhs_dataframe = (
True
if isinstance(self, pd.Series) and isinstance(other, pd.DataFrame)
else False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is_lhs_dataframe_and_rhs_series = (
True
if isinstance(self, pd.DataFrame) and isinstance(other, pd.Series)
else False
)
is_lhs_series_and_rhs_dataframe = (
True
if isinstance(self, pd.Series) and isinstance(other, pd.DataFrame)
else False
)
is_lhs_dataframe_and_rhs_series = isinstance(self, pd.DataFrame) and isinstance(other, pd.Series)
is_lhs_series_and_rhs_dataframe = isinstance(self, pd.Series) and isinstance(other, pd.DataFrame)

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
sort: Optional[bool] = False,
) -> tuple[InternalFrame, InternalFrame]:
"""
Concatenate frames on index axis by taking using UNION operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Concatenate frames on index axis by taking using UNION operator.
Concatenate frames on index axis using UNION operator.

data_column_labels: A list of pandas labels.

Returns:
New InternalFrame after only with given data columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
New InternalFrame after only with given data columns.
New InternalFrame after selecting only the given data columns.

)
# Add data columns
data_column_snowflake_identifiers = []
# A map to keep track number of times a label is already seen.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# A map to keep track number of times a label is already seen.
# A map to keep track of the number of times a label is already seen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants