Skip to content
This repository has been archived by the owner on Jun 3, 2021. It is now read-only.

Latest commit

 

History

History
44 lines (32 loc) · 4.57 KB

readme.md

File metadata and controls

44 lines (32 loc) · 4.57 KB

Autorelease

Spark Benchmark Runner

This repo contains tools to run 2 industry standard benchmark suites:

  1. TPC-DS is an open benchmark suite for structured data systems. This utility aims to make it easy to generate TPC-DS data, and to run TPC-DS benchmarks against different versions of Spark. The main use case for this utility is to test performance at scale when evaluating changes to Spark or to its underlying infrastructure. The benchmarks run SQL queries against structured datasets. This utility is thus not useful for running tests in streaming workflows.

  2. Sort Benchmark: This is a single benchmark that sorts a large amount of data generated by the gensort program.

The benchmark suite can be run on MacOS or CentOS 6+. It does not currently support running on Windows.

Usage

The benchmark suite requires a storage layer, distributed (such as HDFS/S3/Azure Blob Storage) or local to store the generated test data, as well as the computation results. This tool supports running the benchmarks either in local spark mode on a single JVM, or with a cluster manager, such as YARN when running distributed benchmarks on several machines.

  • Download the latest version of the distribution from https://bintray.com/palantir/releases/spark-tpcds-benchmark.
  • Upload and unpack the distribution to a node in the cluster.
  • In the distribution, edit var/conf/config.yml to match the benchmarking environment you will run with. Documentation for the various configurable options are described in the config.yml file.
    • Storage Layer:
      • This tool supports any Hadoop compatible storage layer (eg S3/ABS/HDFS). Once that is setup, the credentials and account details can be updated in the hadoop configuration section. Placeholder configuration blocks are provided for S3, ABS and HDFS.
    • Compute Layer:
      • When running with local spark, the spark configuration section in config.yml should work out of the box.
      • When running on a cluster manager, the cluster first needs to be installed and configured. If you use YARN, this and this are good places to start. Once that is done, the spark and hadoop configuration sections need to be changed to point to the cluster manager.
    • Ephemeral Disks
      • We recommend setting hadoop.tmp.dir to a fast SSD drive for each machine. It is set to a subfolder in /scratch by default.
      • On AWS, we typically use m5d/r5d instance types, which already come with NVMe SSD ephemeral disks, but are not mounted anywhere. We use this script to mount it to scratch. These already come with hardware level encryption, so no LUKS encryption is necessary.
      • On Azure, we typically use hc44rs or d48ds_v4 instance types. These come with SSD ephemeral disks and aren't mounted either. They also do not have hardware level encryption as of the time of writing (July 2020). We use this script to mount and LUKS encrypt them.
  • Set the JAVA_HOME environment variable to point to Java 11.
  • Run service/bin/init.sh start. The benchmarks will begin running in the background. The driver exits upon completing the benchmark suite.

The performance results of running the benchmarks can be found in JSON files located under benchmark_results/ in the specified metrics filesystem. You may use a Spark shell to load these JSON files into DataFrames for additional analysis, or download these JSON files for processing by other tools. Results are grouped by data scale defined by the configuration's dataScalesGb, described below.

For TPC-DS, the correctness of the computation is checked against the results of previous executions of the benchmark against the same set of data. If the source data is regenerated and the previous source data is overwritten, the computation results from previous runs are also invalidated.