feat: Add minimal Pyspark support #908

EdAbati · 2024-09-03T21:43:57Z

What type of PR is this? (check all applicable)

Related issues

Closes [Enh]: Add Support For PySpark #333

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

As mentioned in the latest call, I've started working on the support for PySpark.

The goal of this PR would be to have a minimal initial implementation as a starting point. As we did for Dask, we can implement single methods in following PRs!

~~⚠️ this is not ready for review, a lot of test are failing, the code is ugly. :) Just making the PR for visibility and to have a place to comment/ask questions on specific points~~

EdAbati · 2024-09-12T06:36:50Z

This PR diff is getting big because of all the xfail in tests. 😕
@MarcoGorelli @FBruzzesi do you have a better idea on how to make it more "reviewable"? or do you think it is fine?

for more information, see https://pre-commit.ci

FBruzzesi · 2024-09-12T06:46:33Z

This PR diff is getting big because of all the xfail in tests. 😕 @MarcoGorelli @FBruzzesi do you have a better idea on how to make it more "reviewable"? or do you think it is fine?

For dask we started with it's own test file, so that we didn't have to modify every other file.
Once we had a few methods implemented we shifted the constructor into the conftest list of constructors and added the xfails.

Would that be a good strategy again?

for more information, see https://pre-commit.ci

EdAbati · 2024-10-13T17:34:18Z

This is finally ready for review 🥵

I have to fix the tests on windows (I think I may need to set up Java or something similar) and the test with old version of pandas (that is not compatible with PySpark)
But the rest should be ready!

I tried my best to keep it as small as possible while trying to implement the main functionality. Let me know what you think

EdAbati · 2024-10-14T06:36:03Z

narwhals/_pyspark/expr.py

+ if self._backend_version < (3, 4) or parse_version(np.__version__) > (2, 0):
+ from pyspark.sql.functions import stddev
+
+ _ = ddof
+ return stddev(_input)
+ from pyspark.pandas.spark.functions import stddev
+
+ return stddev(_input, ddof=ddof)


Not sure this is ideal. Unfortunately the stddev in pyspark SQL does not support ddof
From 3.4 they added one function in the pandas namespace that supports that. (but that is not available with numpy 2.0.0)

Based on which version one has, std may return a different result :( Any ideas?

(Correct numbers will depend if spark default ddof changes but for now it seems to be ddof=1)

The formula for adjusting should be fairly easy, something along the line of:

import pyspark.pandas.spark.functions as F N = F.length(_input) return F.stddev(_input) * F.sqrt((N-1)/(N-ddof))

EdAbati · 2024-10-14T10:05:43Z

@MarcoGorelli @FBruzzesi which criteria did we use to decide the minimal supported versions? Popularity? Time of release?

MarcoGorelli · 2024-10-14T17:20:06Z

i'd suggest, the lowest one that's not too difficult to support 😄 I think it'd be OK to set it quite high here, we can always work on lowering it later if there's demand

EdAbati · 2024-10-15T20:37:55Z

Because of pyspark current requirements, trying to make all the tests pass is a bit tricky.

I decided to make 3.3.0 the minimum dependency, the most recent version is 3.5.2.
None of these is technically compatible* with Python 3.12. Is there a way to exclude pyspark from 3.12 tests without affecting coverage?

* technically compatible: Ubuntu tests on 3.12 seem fine, Windows complains. But Python 3.12 will be officially supported starting from 4.0.0: https://issues.apache.org/jira/browse/SPARK-44120

FBruzzesi

Hey @EdAbati this is an awesome effort! And the CI failing is just due to coverage and no failing test is even more incredible!

I am not working with pyspark for some time, I tried to look at it with some critical eyes! I hope it helps

FBruzzesi · 2024-10-19T14:33:40Z

narwhals/_expression_parsing.py

+ from narwhals._pyspark.dataframe import PySparkLazyFrame
+ from narwhals._pyspark.expr import PySparkExpr
+ from narwhals._pyspark.namespace import PySparkNamespace
+ from narwhals._pyspark.typing import IntoPySparkExpr


Commenting here as first encounter, and very nitpick/opinionated: I would rather go with SparkLazyFrame and so on, as for pyarrow everything is just Arrow<class-name> and not PyArrow<class-name>

FBruzzesi · 2024-10-19T14:36:54Z

narwhals/_pyspark/dataframe.py

+ def collect(self) -> Any:
+ import pandas as pd # ignore-banned-import()
+
+ from narwhals._pandas_like.dataframe import PandasLikeDataFrame
+
+ return PandasLikeDataFrame(
+ native_dataframe=self._native_frame.toPandas(),
+ implementation=Implementation.PANDAS,
+ backend_version=parse_version(pd.__version__),
+ dtypes=self._dtypes,
+ )


A similar discussion was happening when I opened #1042 with Marco's concern for how to collect duckdb.
My opinion is that we should let the use decide to which eager backend collect (maybe we one as default).

Now I am not using pyspark in a couple of years, but if pandas is not a dependency, then this collect may also fail.

FBruzzesi · 2024-10-19T14:39:01Z

narwhals/_pyspark/dataframe.py

+ if self._backend_version >= (3, 3, 0):
+ spark_session = self._native_frame.sparkSession
+ else: # pragma: no cover
+ from pyspark.sql import SparkSession
+
+ spark_session = SparkSession.builder.getOrCreate()


My understanding is that 3.3 is then min we want/can support (?), so this would get cleaner?

FBruzzesi · 2024-10-19T14:40:16Z

narwhals/_pyspark/dataframe.py

+ new_columns_list = [col.alias(col_name) for col_name, col in new_columns.items()]
+ return self._from_native_frame(self._native_frame.select(*new_columns_list))


Nice 👌 how do aggregations/reductions behave?

FBruzzesi · 2024-10-19T14:43:32Z

narwhals/_pyspark/expr.py

+ def func(df: PySparkLazyFrame) -> list[Column]:
+ from pyspark.sql import functions as F # noqa: N812
+
+ _ = df
+ return [F.col(col_name) for col_name in column_names]


I am assuming that _ = df is just to avoid complains from linters. You can do:

Suggested change

def func(df: PySparkLazyFrame) -> list[Column]:

from pyspark.sql import functions as F # noqa: N812

_ = df

return [F.col(col_name) for col_name in column_names]

def func(_: PySparkLazyFrame) -> list[Column]:

from pyspark.sql import functions as F # noqa: N812

return [F.col(col_name) for col_name in column_names]

FBruzzesi · 2024-10-19T14:51:35Z

narwhals/_pyspark/expr.py

+ if self._backend_version < (3, 4) or parse_version(np.__version__) > (2, 0):
+ from pyspark.sql.functions import stddev
+
+ _ = ddof
+ return stddev(_input)
+ from pyspark.pandas.spark.functions import stddev
+
+ return stddev(_input, ddof=ddof)


(Correct numbers will depend if spark default ddof changes but for now it seems to be ddof=1)

The formula for adjusting should be fairly easy, something along the line of:

import pyspark.pandas.spark.functions as F N = F.length(_input) return F.stddev(_input) * F.sqrt((N-1)/(N-ddof))

FBruzzesi · 2024-10-19T14:56:39Z

narwhals/_pyspark/namespace.py

+ def func(df: PySparkLazyFrame) -> list[Column]:
+ cols = [c for _expr in parsed_exprs for c in _expr._call(df)]
+ col_name = get_column_name(df, cols[0])
+ return [reduce(operator.and_, cols).alias(col_name)]


I was not able to find any docs on Column.<__and__|__or__>, happy to see they just work

FBruzzesi · 2024-10-19T15:01:19Z

narwhals/_pyspark/expr.py

+ dtypes=self._dtypes,
+ )
+
+ def __add__(self, other: PySparkExpr) -> Self:


Would other require to be parsed into column (e.g. if it is an expr)? Same for all other dunder methods

EdAbati added 2 commits September 3, 2024 08:08

first pyspark draft

72c1b49

Merge remote-tracking branch 'upstream/main' into pyspark

e67140a

github-actions bot added the enhancement New feature or request label Sep 3, 2024

EdAbati changed the title ~~feat: Add Pyspark support~~ feat: Add minimal Pyspark support Sep 3, 2024

EdAbati added 10 commits September 4, 2024 08:44

added schema

3316460

add methods needed for compliant types

12f62c1

fix all_horizontal

2b114eb

add xfail to some tests

378b421

draft with sql

b5957dc

merge upstream

9f8f944

making all frame tests pass

b2aee0e

group by

0e4b2f2

skipping tests

741cdde

restore type

2bdfe31

EdAbati and others added 4 commits September 12, 2024 08:39

smaller diff + mypy fix

c0b1a18

remove print

ec0b26f

Merge remote-tracking branch 'upstream/main' into pyspark

32b87a3

[pre-commit.ci] auto fixes from pre-commit.com hooks

a053b07

for more information, see https://pre-commit.ci

EdAbati and others added 10 commits September 12, 2024 08:48

smaller diff

a415bd0

reenable pyspark

6065eb2

count without window

1688f7d

revert expr series tests

191dcb7

revert rest of tests

41368ef

placeholder pyspark test

b0dffad

merge main

37ecc70

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c76b0b

for more information, see https://pre-commit.ci

moved test_column

9802fdc

moved select filter and with_columns

267f2ff

EdAbati added 8 commits October 13, 2024 15:23

fixing all tests 🎉🎉

95b8395

rename test

734c140

add backend_version

1b9a7e7

added group by tests

9d326a4

add pyspark in requirement dev

1a2e804

use pyspark.sql to create empty df

411f67d

stddev for older pyspark

3a59240

coverage up

08120da

EdAbati marked this pull request as ready for review October 13, 2024 17:31

min pyspark version test

177ec5e

EdAbati commented Oct 14, 2024

View reviewed changes

fix for pyspark 3.2

77e6687

EdAbati added 13 commits October 14, 2024 22:34

pyspark 3.3 as minimum

9ccab80

trying debugging windows

ef1944c

no test pyspark with pandas <1.0.5

a8b228f

removing debug windows

c74772d

Merge remote-tracking branch 'upstream/main' into pyspark

dd0dd39

testing 3.3.0

d00a2da

trying with repartition 2

6b25971

remove unused data

3713a6d

trying to fix sorting problems in tests

eb0a2ce

no pyspark in minimum_versions

df1a37f

trying to make windows happy

ce503fa

fix repartition

94656b3

exclude pyspark for python 3.12

33739de

FBruzzesi reviewed Oct 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add minimal Pyspark support #908

feat: Add minimal Pyspark support #908

EdAbati commented Sep 3, 2024 •

edited

Loading

EdAbati commented Sep 12, 2024 •

edited

Loading

FBruzzesi commented Sep 12, 2024

EdAbati commented Oct 13, 2024

EdAbati Oct 14, 2024 •

edited

Loading

FBruzzesi Oct 19, 2024

EdAbati commented Oct 14, 2024 •

edited

Loading

MarcoGorelli commented Oct 14, 2024

EdAbati commented Oct 15, 2024 •

edited

Loading

FBruzzesi left a comment •

edited

Loading

FBruzzesi Oct 19, 2024

FBruzzesi Oct 19, 2024

FBruzzesi Oct 19, 2024

FBruzzesi Oct 19, 2024

FBruzzesi Oct 19, 2024

FBruzzesi Oct 19, 2024

FBruzzesi Oct 19, 2024

FBruzzesi Oct 19, 2024

		new_columns_list = [col.alias(col_name) for col_name, col in new_columns.items()]
		return self._from_native_frame(self._native_frame.select(*new_columns_list))

feat: Add minimal Pyspark support #908

Are you sure you want to change the base?

feat: Add minimal Pyspark support #908

Conversation

EdAbati commented Sep 3, 2024 • edited Loading

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

EdAbati commented Sep 12, 2024 • edited Loading

FBruzzesi commented Sep 12, 2024

EdAbati commented Oct 13, 2024

EdAbati Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdAbati commented Oct 14, 2024 • edited Loading

MarcoGorelli commented Oct 14, 2024

EdAbati commented Oct 15, 2024 • edited Loading

FBruzzesi left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdAbati commented Sep 3, 2024 •

edited

Loading

EdAbati commented Sep 12, 2024 •

edited

Loading

EdAbati Oct 14, 2024 •

edited

Loading

EdAbati commented Oct 14, 2024 •

edited

Loading

EdAbati commented Oct 15, 2024 •

edited

Loading

FBruzzesi left a comment •

edited

Loading