Create unified fake datasets #1324

norberttech · 2025-01-04T17:26:55Z

I think we can create some very generic fake extractors and define a schema for them, for example

- df()->read(from_flow_orders(limit:1_000))
- flow_orders_schema() : Schema
- do the same for products
- do the same for customers
- do the same for inventory

We should make sure all of those datasets keep a consistent schema and that they are using all possible entry types. Those virtual datasets would need to follow a very strict backward compatibility policy and proper schema evolution.

This would make it so much easier and more realistic to test not only the entire pipeline but also stand-alone scalar functions as we can also create helpers that would just give us one row (and use them inside fake extractors)

I would put those into src/core/etl/tests/Flow/ETL/Tests/Double/Fake/Dataset

The text was updated successfully, but these errors were encountered:

norberttech · 2025-01-04T17:28:37Z

The important part here is that those datasets can't be total random, they need to be fully predictable.

For example, Orders can't start from a random point in time, it should also be possible to configure on the extractor level how many orders per day, time period, % of cancelled orders etc.

Bellangelo · 2025-01-04T17:46:43Z

Do we want the datasets to be mainly a static file that we manipulate or maybe we could utilize libraries such as https://fakerphp.org/ so we can add some controllable randomness into the play?

For example, we could provide a "schema" to faker and then the faker will fill the data for us.

What do you think @norberttech ?

norberttech · 2025-01-04T17:49:54Z

Great question!
IMO data should be 100% generated by faker but we should put some options as I explained above to make those datasets more predictable.

Tests using those datasets should not rely on the values but more on the shape and size of the data.

norberttech added this to Roadmap Jan 4, 2025

norberttech moved this to Todo in Roadmap Jan 4, 2025

norberttech added this to the 0.11.0 milestone Jan 4, 2025

norberttech added developer experience Resolving this issue should improve development experience for the library users. testing labels Jan 4, 2025

norberttech modified the milestones: 0.11.0, 0.12.0 Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create unified fake datasets #1324

Create unified fake datasets #1324

norberttech commented Jan 4, 2025

norberttech commented Jan 4, 2025

Bellangelo commented Jan 4, 2025

norberttech commented Jan 4, 2025

Create unified fake datasets #1324

Create unified fake datasets #1324

Comments

norberttech commented Jan 4, 2025

norberttech commented Jan 4, 2025

Bellangelo commented Jan 4, 2025

norberttech commented Jan 4, 2025