Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Spark-RAPIDS-ML PCA #440

Merged
merged 8 commits into from
Oct 8, 2024

Conversation

rishic3
Copy link
Contributor

@rishic3 rishic3 commented Sep 30, 2024

Update the outdated Scala PCA example to use Python-based Spark-RAPIDS-ML.
Minor changes to Scala dataset for speedup demonstration (using 100k rows vs. 50k rows, using float32 vs. float64).

Signed-off-by: Rishi Chandra <rishic@nvidia.com>
@rishic3 rishic3 marked this pull request as ready for review September 30, 2024 20:21
@eordentlich
Copy link
Collaborator

Ideally, this PR should also delete the legacy pca related code.

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should retain/review instructions/pointers to starting cluster, running the notebook, and installing dependencies.

examples/ML+DL-Examples/Spark-cuML/pca/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep for standalone startup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See updated README - following the Spark-DL instructions to launch the standalone cluster from CLI rather than having separate scripts. Let me know how it looks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep for standalone startup?

${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-worker.sh -c ${CORES_PER_WORKER} -m 16G ${MASTER}

# start jupyter with pyspark
${SPARK_HOME}/bin/pyspark --master ${MASTER} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If spark is started like this, likely have to add (many of if not all of) the configs now in the notebook cell to this command. You should verify (e.g. enabling the etl plugin, gpu resource per executor, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most recent commit sets up the standalone cluster and all the configs in a shell script. For CI folks I have a cell that conditionally creates the session if not already initialized - verified this works with jupyter nbconvert.

Will poke around more though to see if we can avoid some of this code repetition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complicating factor is that the readme instructions start jupyter with a spark context so some configs need to be set up at the time the spark context is created. CI needs the spark context to be started in the notebook. So some duplication is needed, unless the readme just starts normal jupyter server (without spark). But better to keep the current instructions for now.

" return spark\n",
"\n",
"# Check if Spark session is already active, if not, initialize it\n",
"if 'spark' not in globals():\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but it probably is ok to just run the above even if spark is initialized (e.g. if following README instructions).

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@eordentlich eordentlich merged commit 8bc8f9e into NVIDIA:branch-24.10 Oct 8, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants