-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Spark-RAPIDS-ML PCA #440
Conversation
Signed-off-by: Rishi Chandra <rishic@nvidia.com>
Ideally, this PR should also delete the legacy pca related code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should retain/review instructions/pointers to starting cluster, running the notebook, and installing dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep for standalone startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See updated README - following the Spark-DL instructions to launch the standalone cluster from CLI rather than having separate scripts. Let me know how it looks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep for standalone startup?
${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-worker.sh -c ${CORES_PER_WORKER} -m 16G ${MASTER} | ||
|
||
# start jupyter with pyspark | ||
${SPARK_HOME}/bin/pyspark --master ${MASTER} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If spark is started like this, likely have to add (many of if not all of) the configs now in the notebook cell to this command. You should verify (e.g. enabling the etl plugin, gpu resource per executor, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most recent commit sets up the standalone cluster and all the configs in a shell script. For CI folks I have a cell that conditionally creates the session if not already initialized - verified this works with jupyter nbconvert
.
Will poke around more though to see if we can avoid some of this code repetition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complicating factor is that the readme instructions start jupyter with a spark context so some configs need to be set up at the time the spark context is created. CI needs the spark context to be started in the notebook. So some duplication is needed, unless the readme just starts normal jupyter server (without spark). But better to keep the current instructions for now.
" return spark\n", | ||
"\n", | ||
"# Check if Spark session is already active, if not, initialize it\n", | ||
"if 'spark' not in globals():\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but it probably is ok to just run the above even if spark is initialized (e.g. if following README instructions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Update the outdated Scala PCA example to use Python-based Spark-RAPIDS-ML.
Minor changes to Scala dataset for speedup demonstration (using 100k rows vs. 50k rows, using float32 vs. float64).