You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying to run the RAPIDS ML benchmarks since last year with moderate success. I am using RAPIDS ML 24.08 with Spark 3.5.1.
I can run 4 workloads without any issues (i.e., linear regression, logistic regression*, random_forest_classifier, random_forest_regressor), but I am having issues running the rest of the workloads.
For 3 workloads (i.e., KMeans, PCA, UMAP), I can run up to ~5GB datasets, but it starts to fail for >=8GB datasets when using a 16GB GPU with UVM enabled due to "GPU OutOfMemory: could not split inputs and retry" and occasionally CPU OoM. This was mentioned in #680 and NVIDIA/spark-rapids#10567. I tried setting the number of concurrent tasks to just 1, but it doesn't work.
For KNN, I get an error when I use multiple GPUs, mainly "ucp._libs.exceptions.UCXUnreachable: <stream_recv>:" even when I set the shuffler manager mode to multi-threaded. This does not happen to other workload despite the same configuration.
In 24.06, I was able to run KNN on 2~4 GPUs, but I had issues running datasets that require the use of UVM for 3 and 4 GPUs. When using 3 GPUs, I got "ucp._libs.exceptions.UCXUnreachable: <stream_recv>:" only when UVM is required, but it was able to finish running the application after multiple retries. When using 4 GPUs, I got KeyError: '.' when UVM is required.
For DBSCAN, I am getting java.lang.ClassCastException: class com.nvidia.spark.rapids.SerializedTableColumn cannot be cast to class com.nvidia.spark.rapids.GpuColumnVector (com.nvidia.spark.rapids.SerializedTableColumn and com.nvidia.spark.rapids.GpuColumnVector are in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @2fb266a)
I haven't tried out ANN yet.
My config file depends on the size of the dataset and the workload, but this is the config file I used for one of the workloads.
Spark conf
# spark.master yarn
spark.master spark://master:7077
# spark.rapids.sql.concurrentGpuTasks 1
spark.rapids.sql.concurrentGpuTasks 2
spark.driver.memory 100g
spark.executor.memory 50g
spark.executor.cores 4
spark.executor.resource.gpu.amount 1
spark.task.cpus 1
spark.task.resource.gpu.amount 0.25
spark.rapids.memory.pinnedPool.size 50G
# spark.sql.files.maxPartitionBytes 128m
spark.sql.files.maxPartitionBytes 512m
spark.plugins com.nvidia.spark.SQLPlugin
spark.executor.resource.gpu.discoveryScript ./get_gpus_resources.rb
spark.executorEnv.PYTHONPATH /home/ysan/fr/jars/rapids-4-spark_2.12-24.08.1.jar
spark.jars /home/ysan/fr/jars/rapids-4-spark_2.12-24.08.1.jar
spark.rapids.ml.uvm.enabled true
spark.dynamicAllocation.enabled false
spark.executor.extraJavaOptions "-Duser.timezone=UTC"
spark.driver.extraJavaOptions "-Duser.timezone=UTC"
spark.sql.cache.serializer com.nvidia.spark.ParquetCachedBatchSerializer
spark.rapids.memory.gpu.pool NONE
spark.sql.execution.sortBeforeRepartition false
spark.rapids.sql.python.gpu.enabled true
spark.rapids.memory.pinnedPool.size 50g
spark.rapids.sql.batchSizeBytes 128m
spark.sql.adaptive.enabled false
spark.executorEnv.UCX_ERROR_SIGNALS ""
spark.executorEnv.UCX_MEMTYPE_CACHE n
spark.executorEnv.UCX_IB_RX_QUEUE_LEN 1024
spark.executorEnv.UCX_TLS cuda_copy,cuda_ipc,tcp,rc
spark.executorEnv.UCX_RNDV_SCHEME put_zcopy
spark.executorEnv.UCX_MAX_RNDV_RAILS 1
spark.executorEnv.NCCL_DEBUG INFO
spark.rapids.shuffle.manager com.nvidia.spark.rapids.spark351.RapidsShuffleManager
spark.rapids.shuffle.mode UCX
spark.shuffle.service.enabled false
spark.executorEnv.CUPY_CACHE_DIR /tmp/cupy_cache
spark.executorEnv.NCCL_DEBUG INFO
spark.submit.deployMode client
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE docker # does not do anything, leftover from attempting to use containers with YARN without success
spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 # does not do anything, leftover from attempting to use containers with YARN without success
spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE docker # does not do anything, leftover from attempting to use containers with YARN without success
spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE nvidia/cuda:12.6.1-devel-ubi9 # does not do anything, leftover from attempting to use containers with YARN without success
spark.hadoop.fs.s3a.access.key [ACCESS KEY]
spark.hadoop.fs.s3a.secret.key [SECRET KEY]
spark.hadoop.fs.s3a.endpoint [ENDPOINT]
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.attempts.maximum 1
spark.hadoop.fs.s3a.connection.establish.timeout 1000000
spark.hadoop.fs.s3a.connection.timeout 1000000
spark.hadoop.fs.s3a.connection.request.timeout 0
spark.eventLog.enabled true
spark.eventLog.dir file:///tmp/spark-events
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDiretory file:///tmp/spark-events
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
spark.logConf true
spark.executor.heartbeatInterval 1000000s
spark.network.timeout 10000001s
spark.sql.broadcastTimeout 1000000
spark.executorEnv.NCCL_DEBUG WARN # does not really do anything
spark.driverEnv.NCCL_DEBUG WARN # does not really do anything
spark.pyspark.python [PATH]/.venv/bin/python
spark.pyspark.driver.python [PATH]/.venv/bin/python
spark.sql.execution.arrow.maxRecordsPerBatch <value> # value depends on dataset
spark.driver.maxResultSize 0 # only for blob datasets
* - mentioned this on Slack, but I had an issue where the number of iterations vary only when UVM is required, which does not happen for CPU-based Spark MLLib. From the answers I got from Slack, it is inevitable due to the floating point approximations, so I'll consider this workload as having no issues.
EDIT 1: I realized that the final Spark conf used by Spark is different since RAPIDS and Spark add some internal configs. This is what the final conf looks like for the failed application running KNN with 3 GPUs.
EDITS 2-4: I am currently trying to see if upgrading to Apache Spark 3.5.3 (EDIT 3: Spark 3.5.2 since 24.10 does not support Spark 3.5.3) and RAPIDS 24.10 resolves some of the issues. Just tried it on KNN and DBSCAN. KNN still gives me "ucp._libs.exceptions.UCXUnreachable: <stream_recv>: ", but DBSCAN finally gives a different error now (java.lang.OutOfMemoryError: Java heap space). Two of the three workloads that failed for 8GB seem to work now since it can handle 18GB datasets, but I still have to confirm.
Also forgot to mention this before, but for PCA, generating the dataset can lead to an error due to the dimensions not matching for certain number of rows. Having a message that it is due to the number of rows might be a good idea.
The text was updated successfully, but these errors were encountered:
hi @an-ys, thank you for trying out Spark Rapids ML and sharing the issues you encountered. Please let us know anytime if the issues persist after disabling GPU pooling.
I have been trying to run the RAPIDS ML benchmarks since last year with moderate success. I am using RAPIDS ML 24.08 with Spark 3.5.1.
I can run 4 workloads without any issues (i.e., linear regression, logistic regression*, random_forest_classifier, random_forest_regressor), but I am having issues running the rest of the workloads.
For 3 workloads (i.e., KMeans, PCA, UMAP), I can run up to ~5GB datasets, but it starts to fail for >=8GB datasets when using a 16GB GPU with UVM enabled due to "GPU OutOfMemory: could not split inputs and retry" and occasionally CPU OoM. This was mentioned in #680 and NVIDIA/spark-rapids#10567. I tried setting the number of concurrent tasks to just 1, but it doesn't work.
For KNN, I get an error when I use multiple GPUs, mainly "ucp._libs.exceptions.UCXUnreachable: <stream_recv>:" even when I set the shuffler manager mode to multi-threaded. This does not happen to other workload despite the same configuration.
In 24.06, I was able to run KNN on 2~4 GPUs, but I had issues running datasets that require the use of UVM for 3 and 4 GPUs. When using 3 GPUs, I got "ucp._libs.exceptions.UCXUnreachable: <stream_recv>:" only when UVM is required, but it was able to finish running the application after multiple retries. When using 4 GPUs, I got
KeyError: '.'
when UVM is required.For DBSCAN, I am getting
java.lang.ClassCastException: class com.nvidia.spark.rapids.SerializedTableColumn cannot be cast to class com.nvidia.spark.rapids.GpuColumnVector (com.nvidia.spark.rapids.SerializedTableColumn and com.nvidia.spark.rapids.GpuColumnVector are in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @2fb266a)
I haven't tried out ANN yet.
My config file depends on the size of the dataset and the workload, but this is the config file I used for one of the workloads.
Spark conf
* - mentioned this on Slack, but I had an issue where the number of iterations vary only when UVM is required, which does not happen for CPU-based Spark MLLib. From the answers I got from Slack, it is inevitable due to the floating point approximations, so I'll consider this workload as having no issues.
EDIT 1: I realized that the final Spark conf used by Spark is different since RAPIDS and Spark add some internal configs. This is what the final conf looks like for the failed application running KNN with 3 GPUs.
Final spark conf generated by RAPIDS and Spark
EDITS 2-4: I am currently trying to see if upgrading to Apache Spark
3.5.3(EDIT 3: Spark 3.5.2 since 24.10 does not support Spark 3.5.3) and RAPIDS 24.10 resolves some of the issues. Just tried it on KNN and DBSCAN. KNN still gives me "ucp._libs.exceptions.UCXUnreachable: <stream_recv>: ", but DBSCAN finally gives a different error now (java.lang.OutOfMemoryError: Java heap space). Two of the three workloads that failed for 8GB seem to work now since it can handle 18GB datasets, but I still have to confirm.Also forgot to mention this before, but for PCA, generating the dataset can lead to an error due to the dimensions not matching for certain number of rows. Having a message that it is due to the number of rows might be a good idea.
The text was updated successfully, but these errors were encountered: