README

/* DEDISbench
 * (c) 2010 2010 U. Minho. Written by J. Paulo
 */
 
I/O benchmark:
The benchmark is written in C and can simulate several processes writing/reading fixed size 
blocks concurrently into multiple files or in a raw device. The location for the write operations is 
given by NURand function from TPC-C benchmark that generates hotspots for I/O 
operations. A realistic distribution extracted from real storage systems is 
used to generate the blocksÕ content. Moreover, it is possible to 
use other distributions and load them from a custom file using DEDISgen. DEDISbench provides two 
workload modes, one reproduces a fully or peak loaded system, as Bonnie++, that 
performs as maximum write I/O operations per second as possible. The second reproduces 
a system under a reasonable nominal load, and can be useful for understanding the 
behavior of aliasing operations in a stable system with a limited I/O throughput.

For more information regarding DEDISbench algorithm you may read these two published papers:
- "DEDISbench: A Benchmark for Deduplicated Storage Systems"
- "Towards an Accurate Evaluation of Deduplicated Storage Systems"


DEDISbench distributions:

DEDISbench comes with three distinct distributions extracted from real storage systems with different requirements and assumptions:

- An Archival storage where most files have a write-once policy, with non-significant updates, but with sporadic data deletion.
  This distribution is called dist_archival
- A Personal Files storage where some files are updated and deleted frequently and the I/O requests latency is expected to be lower 
  than the one found in archival storages. This distributions is called dist_personalfiles
- A High Performance storage where most files are updated and deleted frequently and I/O latency is expected to be as minimal as possible.
  This distribution is called dist_highperf
  
All these distributions are available and can be simulated by DEDISbench with the option -g <distribution name> or it is possible to input a 
distribution file.

NOTE: by default DEDISbench uses the Personal Files Storage distribution.


Duplicate distribution generator: DEDISgen

The binary DEDISgen generates an output file that describes the distribution of duplicates found
at a specific folder and subfolders or disk device. This program can generate (if specified in the arguments) 
an output distribution file (with option -o) that can be consumed by DEDISbench and used for simulating that duplicate distribution.

With DEDISgen we also pack DEDISgenutils that may be needed for performing more complex analysis of duplicated data.
For instance, if several datasets must be analysed separately and the merged toguether for generating the output distribution file for
DEDISbench it is possible to use these tools as explained below.


Instalation.

The only lib requires should be libc6-dev and libdb-dev libssl-dev

Run ./configure, make and make install in DEDISbench folder

Running the Benchmark

Required flags

 -p or -n<value>		Peak or Nominal Load with throughput rate of N operations per 
 						second. If mixed benchmark of read and writes is defined then use -nr<value> and -nw<value>
 						for nominal rate of reads and writes respectively.
 
 -w or -r or -m			Write or Read Benchmark or a mix of write and read operations.
 
 -t<value> or -s<value> Benchmark duration (-t) in Minutes or amount of data to write (-s) in MB

Optional Flags
 -a<value>			Access pattern for I/O operations: 0-Sequential, 1-Random Uniform, 2-TPCC random 
 					default:2
	
 -c<value>			Number of concurrent processes default:4. Each process has an 
 					independent file associated or a common device if -i flag is used
 
 -f<value>			Size of the file of each process in MB. If -i flag is used, this parameter defines
 					the size of the raw device. default:2048 MB
 -d<value>			choose the directory where DEDISbench writes data
 
 -i<value>			Processes write/read from a raw device instead of having an independent file.
 					If more than one process is defined, each process is assingned with an independet of the raw device,
 					dependent on the raw device size. By default, if this flag is not set each process writes to an individual file. 
	
 -l					I/O latency results are written to a log file to extract additional 
 statistics. Each process writes these values in a file called result(processid) and
 each line, corresponds to a single I/O operations and presents: 
 (latency of I/O operation in microseconds) (current time in seconds) \n
 
 -e or -y			Enable or disable the population of process files before running 
 DEDISbench. test(processid files in DEDISbench directory are the files used by 
 each process. For write testes and by default, empty files are created, 
 for read tests and by default the files are populated with content.
 It is possible to enable this feature for write tests and acess rewrite 
 operations or to disable and use a custom file that respects the namings.
 
 -b<value>			Size of blocks for I/O operations in Bytes default: 4096
 
 -g<value>			Input File with duplicate distribution default:internal file called dist_personalfiles
 					DEDISbench can simulate three real distributions extracted respectively from an Archival, Personal Files and High Performance Storage\n");
					For choosing these distributions the <value> must be dist_archival, dist_personalfiles or dist_highperf respectively.
					The input file details the amount of blocks with a certain number of duplicates
 						and the format is: <number_duplicates> <number_blocks>
 					See below for more info for customizing distribution files and above for info on the default distributions.
 
 -v<value>			Seed for random generator default:current time. Usefull for repeating
 -o<value>			generate an output log with the distribution actually generated by the benchmark
 -k<value>			generate an output log with the access pattern generated by the benchmark
 
 -z					Disable the destruction of process temporary files generated by the benchmark
 -j<value>			Write to file path the output of DEDISbench. This feature also writes two additional files with the same name
 					as given in <value> and a snaplat and snapthr suffix that shows the throughput and lattency average values 
 					for 30 seconds intervals
 -o					The output of DEDIS gives the exact number of unique and duplicate blocks written.
 					(This option requires using additional shared memory)
 -x<value>			I/O Operations synchronization: 0-without fsync and O_DIRECT, 1-O_DIRECT, 2-fsync, 3-both. default: 0
 
 
tests with the same settings.

Examples:

run write benchmark in peak mode for 5 minutes 
./DEDISbench -p -w -t5

run read benchmark in nominal mode (300 reads/second) for 10 minutes
./DEDISbench -n300 -r -t10

run write benchmark in peak mode for 5 minutes. Enable the population of process files
before actually running the benchmark load and enable log output for results.
Use files with 4GB for each process and 8 processes.
./DEDISbench -p -w -t5 -e -l -f4096 -c8

run read benchmark in nominal mode (300 reads/second) for 8 minutes.
Disable the pre-population of files. Carefull because files test(processid) must be 
present and have content or no content will be available for reading resulting in
I/O errors. 

./DEDISbench -n300 -r -t8 -d 

run 8 minutes mixed benchmark with a nominal load (100reads/s and 50 writes/s) in a raw device with 4GB.

./DEDISbench -m -nr100 -nw50 -t8 -ipath/to/rawdevice

Running the generator DEDISgen:

 -f or -d       (Find duplicates in folder -f or in a Disk Device -d)
 -p<value>      (Path for the folder or disk device)
 -o<value>		(Path for the output distribution file. 
 				 This is only necessary if distribution file of duplicates is going to be generated)

 Optional Parameters

 -z<value>		(Path for the folder where duplicates databases are created default: ./gendbs/)
 				- duplicate db has the database with duplicates information hash->number_dups
 -b<value>      (Size of blocks for I/O operations in Bytes default: 4096)
 -o<value>		(Path for the output distribution file. 
 				 This is only necessary if distribution file of duplicates is going to be generated)


Examples

//generate duplicate distribution for files in folder /dir/duplicates/ and its subfolders
./DEDISgen -f -p/dir/duplicates/

//generate duplicate distribution for device /path/device. Use a block size of 8192 Bytes (8KB)
./DEDISgen -d -p/path/device -b8192 


Deduplication distribution FILE:

This file describes the amount of blocks with a specific number of duplicates and has the following format:
<number duplicates> <number blocks>

Example
file:

0 1000
1 500
5 10

There are 1000 blocks without any duplicate, 500 blocks with one duplicate and 10 blocks with 5 duplicates

****** NEW DEDISgenutils ****************

UTILS for DEDISgen:

 -m and -o can be used together or individually if only one operation is intended
 -m<db1 folder path> <db2 folder path>		Merges hashes and duplicates of db1 into db2. 
 -o<db folder path> <outputfile>			Path for generating the output distribution file of duplicates db.

WARNING: these are supposed to be used in duplicates databases generated by dedisgen that map (hash->number-duplicates). 
Not intended for distribution dbs that generate the distribution outputs. These databases are on the 
database folder, which can be specified by the user as parameter with the option -z and by default are on ./gendbs/duplicatedb

Examples:

- Merging the duplicates found separately, with DEDISgen, for dataset 1 (in db1) and 2 (in dbw) db1 with db2

//Merge the two duplicates databases. Merge db1 into db2 - WARNING db2 info will be updated!!!!
./DEDISgenutils -mdb1 db2

//generate output with distribution for db1+db2
./DEDISgenutils -odb2 output_dist


For more information please contact:
jtpaulo at di.uminho.pt