This guide outlines how to create a profile and contains information on the syntax of the DataHelix schema.
-
If you are new to DataHelix, please read the Getting Started page
-
If you would like information on how to contribute to the project as well as a technical overview of the key concepts and structure of the DataHelix then see the Developer Guide.
This section will walk you through creating basic profiles with which you can generate data.
Profiles are JSON documents consisting of two sections: the list of fields and the constraints.
- List of Fields - An array of column headings is defined with unique "name" keys.
"fields": [
{
"name": "Column 1"
},
{
"name": "Column 2"
}
]
-
List of Constraints - an array of constraints that reduce the data in each column from the universal set to the desired range of values. They are formatted as JSON objects. There are two types of constraints:
- Predicate Constraints - predicates that define any given value as being valid or invalid
- Grammatical Constraints - used to combine or modify other constraints
"constraints": [
{
"field": "Column 1",
"equalTo": "foo"
},
{
"field": "Column 2",
"equalTo": "bar"
}
]
These sections are combined to form the complete profile.
{
"fields": [
{
"name": "Column 1",
"type": "string"
},
{
"name": "Column 2",
"type": "integer"
}
],
"constraints": [
{
"field": "Column 1",
"equalTo": "foo"
}
]
}
- See the datahelix playground to run and edit this profile online.
- For a larger profile example see The schema documentation
- Further examples can be found in the Examples folder
Fields are the "slots" of data that can take values. Typical fields might be email_address or user_id. By default, any piece of data is valid for a field. This is an example field object for the profile:
{
"name": "field1",
"type": "decimal",
"nullable": false,
"formatting": "%.5s",
"unique": true
}
Each of the field properties are outlined below:
Name of the field in the output which has to be unique within the Fields
array. If generating into a CSV or database, the name is used as a column name. If generating into JSON, the name is used as the key for object properties.
This is a required property.
The data type of the field. See Data types for more on how types work within DataHelix. Valid options are
decimal
integer
string
date
datetime
time
ISIN
SEDOL
CUSIP
RIC
firstname
lastname
fullname
boolean
faker.<method>.<type>
This is a required property.
Sets the field as nullable. When set to false it is the same as a not null constraint for the field.
This is an optional property of the field and defaults to false.
Used by output serialisers where string output is required.
For the formatting to be applied, the generated data must be applicable, and the value
must be:
- a string recognised by Java's
String.format
method - appropriate for the data type of
field
- not
null
(formatting will not be applied for null values)
Formatting will not be applied if not applicable to the field's value.
Note that currently integer datatypes must be formatted as if they were decimals. For example to format a integer field to be used as part of a string a value
like "number: %.0f"
should be used. See this example to see another example of formatting integers as well as examples of formatting other datatypes.
This is an optional property of the field object and will default to use no formatting.
Sets the field as unique. Unique fields can not be used within grammatical constraints.
This is an optional property of the field object and will default to false.
Within a profile, users can specify two numeric data types: integer and decimal. Under the hood both of these data types are considered numeric from a point of generation but the integer type enforces a granularity of 1, see below for more information on granularity.
Decimals and integers have a maximum value of 1E20, and a minimum value of -1E20.
In profile files, numbers must be expressed as JSON numbers, without quotation marks.
The granularity of a numeric field is a measure of how small the distinctions in that field can be; it is the smallest positive number of which every valid value is a multiple. For instance:
- if a numeric field has a granularity of 1, it can only be satisfied with multiples of 1; the integer data type adds this constraint by default
- if a decimal field has a granularity of 0.1, it can be satisfied by (for example) 1, 2, 1.1 or 1.2, but not 1.11 or 1.12
Granularities must be powers of ten less than or equal to zero (1, 0.1, 0.01, etc). Note that it is possible to specify these granularities in scientific format eg 1E-10, 1E-15 where the 10 and 15 clearly distinguish that these numbers can have up to 10 or 15 decimal places respectively. Granularities outside these restrictions could be potentially useful (e.g. a granularity of 2 would permit only even numbers) but are not currently supported.
Decimal fields currently default to the maximum granularity of 1E-20 (0.00000000000000000001) which means that numbers can be produced with up to 20 decimal places. This numeric granularity also dictates the smallest possible step between two numeric values, for example the next biggest decimal than 10 is 10.00000000000000000001. A user is able to add a granularTo
constraint for a decimal value with coarser granularity (1, 0.1, 0.01...1E-18, 1E-19) but no finer granularity than 1E-20 is allowed.
Note that granularity concerns which values are valid, not how they're presented. If the goal is to enforce a certain number of decimal places in text output, the formattedAs
operator is required. See: What's the difference between formattedAs and granularTo?
Strings are sequences of unicode characters with a maximum length of 1000 characters. Currently, only basic latin characters (unicode 002c - 007e) are supported.
DateTimes represent specific moments in time, and are specified in profiles through specialised strings:
"2000-01-01T09:00:00.000"
The format is a subset of ISO-8601; the date and time must be fully specified as above, with precisely 3 digits of sub-second precision, plus an optional offset specifier of "Z". All datetimes are treated as UTC.
DateTimes can be in the range 0001-01-01T00:00:00.000Z
to 9999-12-31T23:59:59.999Z
that is midnight on the 1st January 0001
to 1 millisecond to midnight on the 31 December 9999
.
The granularity of a DateTime field is a measure of how small the distinctions in that field can be; it is the smallest positive unit of which every valid value is a multiple. For instance:
- if a DateTime field has a granularity of years, it can only be satisfied by dates that are complete years (e.g.
2018-01-01T00:00:00.000Z
)
Granularities must be one of the units: millis
, seconds
, minutes
, hours
, days
, months
, years
.
DateTime fields currently default to the most precise granularity of milliseconds. A user is able to add a granularTo
constraint for a DateTime value with coarser granularity (seconds, minutes...years) but no finer granularity than milliseconds is currently allowed.
Note that granularity concerns which values are valid, not how they're presented. All values will be output with the full format defined by ISO-8601, so that a value granular to years will still be output as (e.g.) 0001-01-01T00:00:00.000Z
, rather than 0001
or 0001-01-01
.
The date type can be used as a shorthand to create a datetime with a granularity and formatting of days. Dates should be specified in profiles as "YYYY-MM-DD"
. For example:
"2001-01-01"
Time represents a specific time in a day. Currently times down to milliseconds are supported. Similarly to datetime, time is specified by a string; either hh:mm:ss
or hh:mm:ss.ms
.
Before/after ect. constraints compare times by treating 00:00:00
as the starting time.
Supported granularites are 'half_days'
, 'hours'
, 'minutes'
, 'seconds'
and 'millis'
with the default being 'millis'
.
Users can specify boolean data types which will take the values true
and false
.
Currently these types are only supported with the equalTo
and equalToField
constraints.
Users can invoke the Faker custom data generators to create values.
All of the types supplied on the Faker class are accessible. Methods are invoked by entering the method signature chain. For example if we want to generate job titles, then we have to find job
class in the Java Faker API documentation. From this we can see that one of the methods we can invoke on job
is title
. To use this in the profile we would supply the field type as faker.job.title
as shown in the following profile.
{
"fields": [{"name": "Job-Title", "type": "faker.job.title"}]
}
- A predicate constraint defines any given value as being valid or invalid
- The universal set contains all values that can be generated (
null
, any string, any date, any number, etc) - The denotation of a constraint is the subset of the universal set that it defines as valid
See set restriction and generation for an in depth explanation of how the constraints are merged and data generated from them.
If no constraints are defined over a field, then it can accept any member of the universal set. Each constraint added to that field progressively limits the universal set.
The grammatical not
constraint inverts a constraint's denotation; in other words, it produces the complement of the constraint's denotation and the universal set.
equalTo
(field, value)
{ "field": "type", "equalTo": "X_092" }
OR
{ "field": "type", "equalTo": 23 }
OR
{ "field": "type", "equalTo": "2001-02-03T04:05:06.007" }
OR
{ "field": "type", "equalTo": "03:02:59" }
OR
{ "field": "type", "equalTo": true }
Is satisfied if field
's value is equal to value
inSet
(field, values)
{ "field": "type", "inSet": [ "X_092", "2001-02-03T04:05:06.007" ] }
Is satisfied if field
's value is in the set values
Alternatively, sets can be populated from files.
{ "field": "country", "inSet": "countries.csv" }
Populates a set from the new-line delimited file (with suffix .csv
), where each line represents a string value to load.
The file should be location in the same directory as the jar, or in the directory explicitly specified using the command line argument --set-from-file-directory
, and the name should match the value
with .csv
appended.
Alternatively an absolute path can be used which does not have any relation to the jar location.
In the above example, this would be countries.csv
.
Example countries.csv
excerpt:
...
England
Wales
Scotland
...
Additionally, weights can be included in the source file, which will then weight each element proportionally to its weight.
Example countries_weighted.csv
excerpt:
...
England, 2
Wales, 1
Scotland, 3
...
After loading the set from the file, this constraint behaves identically to the inSet constraint. This includes its behaviour when negated. See the inSet example for an example showing the inSet
constraint being used with a file.
inMap
(field, file, key)
{
"field": "country",
"inMap": "countries.csv",
"key": "Country"
}
Is satisfied if field
's value is in the map with the key Country
.
For each field using the same map when one value is picked from the map all fields will use the same row.
It populates the map from a new-line delimited file (with suffix .csv
), where each line represents a value to load. A header is required in the file to identify which column is related to a key.
The file should be location in the same directory as the jar, or in the directory explicitly specified using the command line argument --set-from-file-directory
, and the name should match the value
with .csv
appended.
Alternatively an absolute path can be used which does not have any relation to the jar location.
In the above example, this would be countries.csv
.
Example countries.csv
excerpt:
Country, Capital
England, London
Wales, Cardiff
Scotland, Edinburgh
...
isNull
(field)
{ "field": "price", "isNull" : true }
Is satisfied if field
is null or absent.
matchingRegex
(field, value)
{ "field": "name", "matchingRegex": "[a-z]{0, 10}" }
Is satisfied if field
is a string matching the regular expression expressed in value
. The regular expression must match the entire string in field
, start and end anchors ^
& $
are ignored.
The following non-capturing groups are unsupported:
- Negative look ahead/behind, e.g.
(?!xxx)
and(?<!xxx)
- Positive look ahead/behind, e.g.
(?=xxx)
and(?<=xxx)
containingRegex
(field, value)
{ "field": "name", "containingRegex": "[a-z]{0, 10}" }
Is satisfied if field
is a string containing the regular expression expressed in value
. Using both start and end anchors ^
& $
make the constraint behave like matchingRegex
.
The following non-capturing groups are unsupported:
- Negative look ahead/behind, e.g.
(?!xxx)
and(?<!xxx)
- Positive look ahead/behind, e.g.
(?=xxx)
and(?<=xxx)
ofLength
(field, value)
{ "field": "name", "ofLength": 5 }
Is satisfied if field
is a string whose length exactly matches value
, must be a whole number between 0
and 1000
.
longerThan
(field, value)
{ "field": "name", "longerThan": 3 }
Is satisfied if field
is a string with length greater than value
, must be a whole number between -1
and 999
.
shorterThan
(field, value)
{ "field": "name", "shorterThan": 3 }
Is satisfied if field
is a string with length less than value
, must be a whole number between 1
and 1001
.
greaterThan
(field, value)
{ "field": "price", "greaterThan": 0 }
Is satisfied if field
is a number greater than value
.
greaterThanOrEqualTo
(field, value)
{ "field": "price", "greaterThanOrEqualTo": 0 }
Is satisfied if field
is a number greater than or equal to value
.
lessThan
(field, value)
{ "field": "price", "lessThan": 0 }
Is satisfied if field
is a number less than value
.
lessThanOrEqualTo
(field, value)
{ "field": "price", "lessThanOrEqualTo": 0 }
Is satisfied if field
is a number less than or equal to value
.
granularTo
(field, value)
{ "field": "price", "granularTo": 0.1 }
Is satisfied if field
has at least the granularity specified in value
.
The time and datetime are shared but must be used with the same type. For example an equal to constraint on a time field must have a value of time.
after
(field, value)
{ "field": "date", "after": "2018-09-01T00:00:00.000" }
Is satisfied if field
is a time or datetime occurring after value
.
afterOrAt
(field, value)
{ "field": "time", "afterOrAt": "00:00:00" }
Is satisfied if field
is a time or datetime occurring after or simultaneously with value
.
before
(field, value)
{ "field": "date", "before": "2018-09-01T00:00:00.000" }
Is satisfied if field
is a time or datetime occurring before value
.
beforeOrAt
(field, value)
{ "field": "date", "beforeOrAt": "2018-09-01T00:00:00.000" }
Is satisfied if field
is a time or datetime occurring before or simultaneously with value
.
granularTo
(field, value)
{ "field": "date", "granularTo": "days" }
Is satisfied if field
has at least the granularity specified in value
. Note that in the case where you want to give a datetime a granularity of days, the date type can be used as a short hand.
allows a time/datetime field to be dependent on the output of another time/datetime field.
{ "field": "laterDateField", "after": "previousDateField" }
allows a numeric field to be dependent on the output of another numeric field.
{ "field": "laterNumericField", "greaterThanField": "previousNumericField" }
Allows a dependent time/datetime/numeric field to always be a certain offset away from another time/datetime/numeric field.
The syntax is slightly different depending on the type.
{ "field": "field1", "equalToField": "field2", "offset": 3}
{ "field": "field1", "equalToField": "field2", "offset": 3, "offsetUnit": "days" }
Note that offsetUnit can be any of the granularites supported by datahelix.
Additionally in the case that the field is a datetime then the working days
offsetUnit can be used to specify an offset of working days. See an example of this in the online playground.
Grammatical constraints combine or modify other constraints. They are fully recursive; any grammatical constraint is a valid input to any other grammatical constraint.
See set restriction and generation for an in depth explanation of how the constraints are merged and data generated from them.
{ "not": { "field": "foo", "equalTo": "bar" } }
Wraps a constraint. Is satisfied if, and only if, its inner constraint is not satisfied.
{ "anyOf": [
{ "field": "foo", "isNull": true },
{ "field": "foo", "equalTo": 0 }
]}
Contains a number of sub-constraints. Is satisfied if any of the inner constraints are satisfied.
{ "allOf": [
{ "field": "foo", "greaterThan": 15 },
{ "field": "foo", "lessThan": 100 }
]}
Contains a number of sub-constraints. Is satisfied if all of the inner constraints are satisfied.
{
"if": { "field": "foo", "lessThan": 100 },
"then": { "field": "bar", "greaterThan": 0 },
"else": { "field": "bar", "equalTo": "N/A" }
}
Is satisfied if either:
- Both the
if
andthen
constraints are satisfied - The
if
constraint is not satisfied, and theelse
constraint is
While it's not prohibited, wrapping conditional constraints in any other kind of constraint (eg, a not
) may cause unintuitive results.
You can add your own custom java generators to the project with the following instructions.
To add a custom generator you will need to
- clone the datahelix source code
- go to the "custom" package
- either
- implement the CustomGenerator.java interface
- use the CustomGeneratorBuilder.java to build a custom generator
- add your custom generator to the list in the CustomGeneratorList.java class
There is an example folder in the "custom" package which shows an example using the CustomGeneratorBuilder to build a generator called "lorem ipsum"
To use your custom generator you add it to the field definition in your profile like this
{
"name": "field1",
"type": "string",
"generator": "lorem ipsum"
}
This will use the "lorem ipsum" example custom generator.
To use your own, put the name of your generator instead of "lorem ipsum"
You can also use custom generators as constraints
{ "field": "field1", "generator": "lorem ipsum" }
Custom generators can be used in "anyOf" grammatical constraints, as well as in the "then" and "else" parts of conditional constraints
To combine generators with sets and equalTo, you will need to create a 'matchingFunction' when building the custom generator. Which should be a function that returns true if a value is one the custom generator could produce.
To be able negate the custom generator, or use in the 'if' section of an if then statement, you must define the 'negated Generator' when building the custom generator. Which should return values that the custom generator should not produce.
Profiles can be run against a jar using the command line.
An example command would be something like
java -jar datahelix.jar --max-rows=100 --replace --profile-file=profile.json --output-path=output.csv
it is also possible to execute the generator using a wrapper script:
on windows:
datahelix --max-rows=100 --replace --profile-file=profile.json --output-path=output.csv
and on linux
datahelix.sh --max-rows=100 --replace --profile-file=profile.json --output-path=output.csv
These presume that the scripts (datahelix.zip\datahelix\bin) are in the path, or you're currently working in the bin directory.
Option switches are case-sensitive, arguments are case-insensitive
--version
(or-V
)- Displays generator version information.
--profile-file=<PATH>
(or-p <PATH>
)- Path to the input profile file.
--output-path=<PATH>
(or-o <PATH>
)- Path to the output file. If not specified, output will be to standard output.
--replace=<true|false>
- Overwrite/replace existing output files. Defaults to false.
-n <ROWS>
or--max-rows=<ROWS>
- Emit at most
<ROWS>
rows to the output file, if not specified will limit to 10,000,000 rows. - Mandatory in
RANDOM
mode.
- Emit at most
--generation-type=<GENERATION_TYPE>
- Determines the type of (data generation)[Link] performed.
Where
<GENERATION_TYPE>
can be one ofFULL_SEQUENTIAL
orRANDOM
(default).
- Determines the type of (data generation)[Link] performed.
Where
--combination-strategy=<COMBINATION_STRATEGY>
- Determines the type of combination strategy used in full sequential mode.
<COMBINATION_STRATEGY>
can be one ofMINIMAL
(default),EXHAUSTIVE
orPINNING
.
- Determines the type of combination strategy used in full sequential mode.
--output-format=<OUTPUT_FORMAT>
- Determines the output format.
<OUTPUT_FORMAT>
can be one ofcsv
(default) orjson
. - If no
output-path
is provided then the JSON data will be streamed in ndjson format.
- Determines the output format.
--visualiser-level=<VISUAL_LEVEL>
- Determines level of visualisation using. Where
<VISUAL_LEVEL>
can be one ofOFF
(default),STANDARD
orDETAILED
.
- Determines level of visualisation using. Where
--visualiser-output-folder=<PATH>
- The path to the folder to write the generated visualiser files to (defaults to current directory (
.
). Its only used ifvisualiser-level
is not set toOFF
.
- The path to the folder to write the generated visualiser files to (defaults to current directory (
By default the generator will report how much data has been generated over time, the other options are below:
--verbose
- Will report in-depth detail of data generation.
--quiet
- Will disable velocity reporting.
--quiet
will be ignored if --verbose
is supplied.
The generator supports the following data generation types
- Random (default)
- Full Sequential
Examples:
Constraint | Emitted valid data |
---|---|
Field 1 > 10 AND Field 1 < 20 |
(any values > 10 & < 20) |
Field 1 in set [A, B, C] |
(A, B or C in any order, repeated as needed) |
Notes:
- Random generation of data is infinite and is limited to 1000 by default, use
--max-rows
to enable generation of more data.
Examples:
Constraint | Emitted valid data |
---|---|
Field 1 > 0 AND Field 1 < 5 |
(null, 1, 2, 3, 4) |
Field 1 in set [A, B, C] |
(null, A, B, C) |
- Note that null will only be produced depending on the properties of Field 1.
There are a few different combination strategies which can be used in FULL_SEQUENTIAL
mode with minimal being the default. In modes other than full sequential, combination strategy will have no effect.
It is simplest to see how the different combination strategies work by look at the effect on a simple example profile. The following profile contains two fields:
- field1 - has values in set [ "A", "B" ]
- field2 - has values in set [ 1, 2, 3 ]
The minimal strategy outputs the minimum data required to exemplify each value at least once. Per the example, the output would be:
- "A",1
- "B",2
- "B",3
Note that minimal is the default combination strategy.
The exhaustive strategy outputs all possible combinations. Given the fields as defined above, possible outputs would be:
- "A",1
- "B",1
- "A",2
- "B",2
- "A",3
- "B",3
The pinning strategy establishes a baseline for each field (generally by picking the first available value for that field) and then creates outputs such that either:
- All values equal the baseline for the respective field
- All values except one equal the baseline for the respective field
To generate these outputs, we first output the first case (all values from baseline) and then iterate through each field, F, fixing all other fields at their baseline and generating the full range of values for F. For the example, the output would be:
- "A",1
- "A",2
- "A",3
- "B",1
This is an alpha feature. Please do not rely on it. If you find issues with it, please report them.
This feature generates a DOT compliant representation of the decision tree, for manual inspection, in the form of a DOT formatted file.
If use the --visualiser-level
and --visualiser-output-folder
command line options when generating data then
you can get visualisations of the decision tree outputted as graphs in DOT files.
- See Developer Guide for more information on the decision tree structure.
- See Command Line Arguments for more information on the command line arguments.
The visualiser levels can have the following values:
- OFF - the default and it means no graphs outputted
- STANDARD - it means graphs of the decision tree outputted after each pre-walking stage is done
- DETAILED - it means the standard decision trees and the various decision trees created during walking stage are outputted
-
You may read a DOT file with any text editor
-
You can also use this representation with a visualiser such as Graphviz.
There may be other visualisers that are suitable to use. The requirements for a visualiser are known (currently) as:
- DOT files are encoded with UTF-8, visualisers must support this encoding.
- DOT files can include HTML encoded entities, visualisers should support this feature.