SNOW-1752856: Implement DataFrame/Series align for axis = 1 and None #2541

sfc-gh-lmukhopadhyay · 2024-10-30T19:21:55Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1752856
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
Please describe how your code solves the related issue.

Adding support for DataFrame/Series align for axis = 1 and None.

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>

sfc-gh-jjiao · 2024-10-30T23:51:17Z

src/snowflake/snowpark/modin/plugin/_internal/align_utils.py

+    return frame1, frame2
+
+
+def _select_columns(


should we add a unit test for _select_columns ?

usually I believe internal util functions don't have unit tests, such as concat_utils._select_columns but I can add one!

sfc-gh-joshi

Nice work, just left some questions + style comments.

sfc-gh-joshi · 2024-10-31T18:42:39Z

src/snowflake/snowpark/modin/plugin/_internal/align_utils.py

+    # try to match as many columns as possible from the frames.
+    label_count_map: dict[Hashable, int] = {}
+    for label, id_tuple in zip(data_column_labels, snowflake_ids):
+        if len(id_tuple) <= label_count_map.get(label, 0):


I'm a little confused by what this check is doing. My understanding is:

id_tuple is all the SF quoted identifiers corresponding to the pandas label

label_count_map maps pandas labels to the number of times we've encountered it within a frame

Is len(id_tuple) <= label_count_map.get(label, 0) just checking whether this is the first time we've encountered this label in the loop? Why are we choosing to add a NaN column when that's the case, and adding the original column of any duplicates we encounter?

len(id_tuple) <= label_count_map.get(label, 0) is true when a label does not exist in the original frame (for example when frame has columns A, B, B, D and data_column_labels=[A, B, C, D]. The snowflake_ids is [("A",), ("B", "B_7wi3"), ("B", "B_7wi3"), (), ("D",)]

So the id_tuple for C would be () with len 0 which is <= label_count_map.get(C, 0). In this case, we would add a NaN column for label C.

sfc-gh-joshi · 2024-10-31T18:43:35Z

src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py

-                "Snowpark pandas doesn't support `align` with MultiIndex"
-            )
+        if axis is not None:
+            if self.is_multiindex(axis=axis) or other._query_compiler.is_multiindex(


Should we raise this error when axis=None as well?

+1. Also, do we have tests for multi index with axis=None?

frame.is_multiindex has a check in num_index_levels that asserts axis = 0 or 1. I'm not sure if this is intended or not, but I would run into the ValueError when axis=None.

@sfc-gh-helmeleegy Oh not yet, I'll add the neg multiindex tests with axis 1 and None

sfc-gh-joshi · 2024-10-31T18:45:26Z

src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py

+            if join == "outer":
+                left_frame, right_frame = align_axis_1(frame, other_frame, join, True)
+            else:
+                left_frame, right_frame = align_axis_1(frame, other_frame, join, False)


Suggested change

if join == "outer":

left_frame, right_frame = align_axis_1(frame, other_frame, join, True)

else:

left_frame, right_frame = align_axis_1(frame, other_frame, join, False)

should_sort = join == "outer"

left_frame, right_frame = align_axis_1(frame, other_frame, join, should_sort)

A little bit shorter and easier to read at a glance. Might also want to add a comment explaining that we preserve key order for non-outer joins.

sfc-gh-joshi · 2024-10-31T18:47:15Z

src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py

+            if join == "outer":
+                left_frame_1, right_frame_1 = align_axis_1(
+                    frame, other_frame, join, True
+                )
+            else:
+                left_frame_1, right_frame_1 = align_axis_1(
+                    frame, other_frame, join, False
+                )


Suggested change

if join == "outer":

left_frame_1, right_frame_1 = align_axis_1(

frame, other_frame, join, True

)

else:

left_frame_1, right_frame_1 = align_axis_1(

frame, other_frame, join, False

)

should_sort = join == "outer"

left_frame_1, right_frame_1 = align_axis_1(frame, other_frame, join, should_sort)

sfc-gh-joshi · 2024-10-31T18:48:49Z

src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py

+                right_frame_data_ids,
+                right_index_ids,
+            ) = align_axis_0_right(left_frame_1, right_frame_1, join)
+            left_qc = SnowflakeQueryCompiler(


Question not really in the scope of this PR: why don't align_axis_0_left/right return an InternalFrame directly?

sfc-gh-joshi · 2024-10-31T18:49:56Z

src/snowflake/snowpark/modin/plugin/extensions/base_overrides.py

+    is_lhs_dataframe_and_rhs_series = (
+        True
+        if isinstance(self, pd.DataFrame) and isinstance(other, pd.Series)
+        else False
+    )
+    is_lhs_series_and_rhs_dataframe = (
+        True
+        if isinstance(self, pd.Series) and isinstance(other, pd.DataFrame)
+        else False
+    )


Suggested change

is_lhs_dataframe_and_rhs_series = (

True

if isinstance(self, pd.DataFrame) and isinstance(other, pd.Series)

else False

)

is_lhs_series_and_rhs_dataframe = (

True

if isinstance(self, pd.Series) and isinstance(other, pd.DataFrame)

else False

)

is_lhs_dataframe_and_rhs_series = isinstance(self, pd.DataFrame) and isinstance(other, pd.Series)

is_lhs_series_and_rhs_dataframe = isinstance(self, pd.Series) and isinstance(other, pd.DataFrame)

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>

sfc-gh-helmeleegy · 2024-10-31T20:44:34Z

src/snowflake/snowpark/modin/plugin/_internal/align_utils.py

+    sort: Optional[bool] = False,
+) -> tuple[InternalFrame, InternalFrame]:
+    """
+    Concatenate frames on index axis by taking using UNION operator.


Suggested change

Concatenate frames on index axis by taking using UNION operator.

Concatenate frames on index axis using UNION operator.

sfc-gh-helmeleegy · 2024-10-31T20:47:57Z

src/snowflake/snowpark/modin/plugin/_internal/align_utils.py

+        data_column_labels: A list of pandas labels.
+
+    Returns:
+        New InternalFrame after only with given data columns.


Suggested change

New InternalFrame after only with given data columns.

New InternalFrame after selecting only the given data columns.

sfc-gh-helmeleegy · 2024-10-31T20:49:46Z

src/snowflake/snowpark/modin/plugin/_internal/align_utils.py

+    )
+    # Add data columns
+    data_column_snowflake_identifiers = []
+    # A map to keep track number of times a label is already seen.


Suggested change

# A map to keep track number of times a label is already seen.

# A map to keep track of the number of times a label is already seen.

sfc-gh-lmukhopadhyay added 3 commits October 30, 2024 12:17

SNOW-1752856: Implement DataFrame/Series align for axis = 1 and None

d921038

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>

Merge branch 'main' into lmukhopadhyay-SNOW-1752856-align-axis-1-none

ce94ca3

changelog

9bb64ce

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>

github-actions bot added the snowpark-pandas label Oct 30, 2024

sfc-gh-lmukhopadhyay added 2 commits October 30, 2024 13:03

fix doctest

e8f1c65

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>

Merge branch 'main' into lmukhopadhyay-SNOW-1752856-align-axis-1-none

2772842

sfc-gh-lmukhopadhyay marked this pull request as ready for review October 30, 2024 20:38

sfc-gh-lmukhopadhyay requested a review from a team as a code owner October 30, 2024 20:38

sfc-gh-lmukhopadhyay requested review from sfc-gh-evandenberg and sfc-gh-rdurrani October 30, 2024 20:38

sfc-gh-jjiao requested review from sfc-gh-helmeleegy, sfc-gh-joshi and sfc-gh-nkrishna October 30, 2024 23:41

sfc-gh-jjiao reviewed Oct 30, 2024

View reviewed changes

sfc-gh-joshi requested changes Oct 31, 2024

View reviewed changes

sfc-gh-lmukhopadhyay added 2 commits October 31, 2024 12:21

add select_columns util test

23fc257

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>

resolve merge conf

4708b92

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>

sfc-gh-helmeleegy reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1752856: Implement DataFrame/Series align for axis = 1 and None #2541

SNOW-1752856: Implement DataFrame/Series align for axis = 1 and None #2541

sfc-gh-lmukhopadhyay commented Oct 30, 2024

sfc-gh-jjiao Oct 30, 2024

sfc-gh-lmukhopadhyay Oct 31, 2024

sfc-gh-joshi left a comment

sfc-gh-joshi Oct 31, 2024 •

edited

Loading

sfc-gh-lmukhopadhyay Oct 31, 2024

sfc-gh-joshi Oct 31, 2024

sfc-gh-helmeleegy Oct 31, 2024

sfc-gh-lmukhopadhyay Oct 31, 2024

sfc-gh-lmukhopadhyay Oct 31, 2024

sfc-gh-joshi Oct 31, 2024

sfc-gh-joshi Oct 31, 2024

sfc-gh-joshi Oct 31, 2024

sfc-gh-joshi Oct 31, 2024

sfc-gh-helmeleegy Oct 31, 2024

sfc-gh-helmeleegy Oct 31, 2024

sfc-gh-helmeleegy Oct 31, 2024

	Concatenate frames on index axis by taking using UNION operator.
	Concatenate frames on index axis using UNION operator.

	New InternalFrame after only with given data columns.
	New InternalFrame after selecting only the given data columns.

	# A map to keep track number of times a label is already seen.
	# A map to keep track of the number of times a label is already seen.

SNOW-1752856: Implement DataFrame/Series align for axis = 1 and None #2541

Are you sure you want to change the base?

SNOW-1752856: Implement DataFrame/Series align for axis = 1 and None #2541

Conversation

sfc-gh-lmukhopadhyay commented Oct 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-joshi left a comment

Choose a reason for hiding this comment

sfc-gh-joshi Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-joshi Oct 31, 2024 •

edited

Loading