Update Spark-RAPIDS-ML PCA #440

rishic3 · 2024-09-30T20:17:15Z

Update the outdated Scala PCA example to use Python-based Spark-RAPIDS-ML.
Minor changes to Scala dataset for speedup demonstration (using 100k rows vs. 50k rows, using float32 vs. float64).

Signed-off-by: Rishi Chandra <rishic@nvidia.com>

examples/ML+DL-Examples/Spark-Rapids-ML/pca/pca.ipynb

eordentlich · 2024-10-03T22:50:59Z

Ideally, this PR should also delete the legacy pca related code.

eordentlich

We should retain/review instructions/pointers to starting cluster, running the notebook, and installing dependencies.

examples/ML+DL-Examples/Spark-cuML/pca/README.md

eordentlich · 2024-10-04T00:00:00Z

examples/ML+DL-Examples/Spark-cuML/pca/start-spark.sh

Keep for standalone startup?

See updated README - following the Spark-DL instructions to launch the standalone cluster from CLI rather than having separate scripts. Let me know how it looks.

eordentlich · 2024-10-04T00:00:22Z

examples/ML+DL-Examples/Spark-cuML/pca/spark-env.sh

Keep for standalone startup?

eordentlich · 2024-10-04T16:18:16Z

examples/ML+DL-Examples/Spark-Rapids-ML/pca/README.md

+${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-worker.sh -c ${CORES_PER_WORKER} -m 16G ${MASTER}
+
+# start jupyter with pyspark
+${SPARK_HOME}/bin/pyspark --master ${MASTER} \


If spark is started like this, likely have to add (many of if not all of) the configs now in the notebook cell to this command. You should verify (e.g. enabling the etl plugin, gpu resource per executor, etc).

Most recent commit sets up the standalone cluster and all the configs in a shell script. For CI folks I have a cell that conditionally creates the session if not already initialized - verified this works with jupyter nbconvert.

Will poke around more though to see if we can avoid some of this code repetition.

The complicating factor is that the readme instructions start jupyter with a spark context so some configs need to be set up at the time the spark context is created. CI needs the spark context to be started in the notebook. So some duplication is needed, unless the readme just starts normal jupyter server (without spark). But better to keep the current instructions for now.

eordentlich · 2024-10-04T18:57:23Z

examples/ML+DL-Examples/Spark-Rapids-ML/pca/notebooks/pca.ipynb

+    "    return spark\n",
+    "\n",
+    "# Check if Spark session is already active, if not, initialize it\n",
+    "if 'spark' not in globals():\n",


This is fine, but it probably is ok to just run the above even if spark is initialized (e.g. if following README instructions).

eordentlich

👍

Update Spark-RAPIDS-ML PCA

49f982e

Signed-off-by: Rishi Chandra <rishic@nvidia.com>

rishic3 marked this pull request as ready for review September 30, 2024 20:21

rishic3 added 2 commits October 3, 2024 18:20

Reran with standalone

8c4765a

Fix typo

8e0d142

eordentlich reviewed Oct 3, 2024

View reviewed changes

examples/ML+DL-Examples/Spark-Rapids-ML/pca/pca.ipynb Outdated Show resolved Hide resolved

Delete Scala example, remove mean-centering

f3c363e

eordentlich reviewed Oct 4, 2024

View reviewed changes

Update README

418526c

eordentlich reviewed Oct 4, 2024

View reviewed changes

rishic3 added 3 commits October 4, 2024 17:19

Update standalone setup script, README

76ee00e

SparkSession init for CI

dcb3273

remove sparkcontext

711d6cb

eordentlich reviewed Oct 4, 2024

View reviewed changes

eordentlich approved these changes Oct 4, 2024

View reviewed changes

eordentlich merged commit 8bc8f9e into NVIDIA:branch-24.10 Oct 8, 2024
2 checks passed

GaryShen2008 mentioned this pull request Oct 21, 2024

Update pca example in Spark-cuML to use spark-rapids-ml pyspark API variant #417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Spark-RAPIDS-ML PCA #440

Update Spark-RAPIDS-ML PCA #440

rishic3 commented Sep 30, 2024 •

edited

Loading

eordentlich commented Oct 3, 2024

eordentlich left a comment

eordentlich Oct 4, 2024

rishic3 Oct 4, 2024

eordentlich Oct 4, 2024

eordentlich Oct 4, 2024

rishic3 Oct 4, 2024

eordentlich Oct 4, 2024

eordentlich Oct 4, 2024

eordentlich left a comment

Update Spark-RAPIDS-ML PCA #440

Update Spark-RAPIDS-ML PCA #440

Conversation

rishic3 commented Sep 30, 2024 • edited Loading

eordentlich commented Oct 3, 2024

eordentlich left a comment

Choose a reason for hiding this comment

eordentlich Oct 4, 2024

Choose a reason for hiding this comment

rishic3 Oct 4, 2024

Choose a reason for hiding this comment

eordentlich Oct 4, 2024

Choose a reason for hiding this comment

eordentlich Oct 4, 2024

Choose a reason for hiding this comment

rishic3 Oct 4, 2024

Choose a reason for hiding this comment

eordentlich Oct 4, 2024

Choose a reason for hiding this comment

eordentlich Oct 4, 2024

Choose a reason for hiding this comment

eordentlich left a comment

Choose a reason for hiding this comment

rishic3 commented Sep 30, 2024 •

edited

Loading