Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create unified fake datasets #1324

Open
norberttech opened this issue Jan 4, 2025 · 3 comments
Open

Create unified fake datasets #1324

norberttech opened this issue Jan 4, 2025 · 3 comments
Labels
developer experience Resolving this issue should improve development experience for the library users. testing
Milestone

Comments

@norberttech
Copy link
Member

I think we can create some very generic fake extractors and define a schema for them, for example

- df()->read(from_flow_orders(limit:1_000))
- flow_orders_schema() : Schema
- do the same for products
- do the same for customers
- do the same for inventory

We should make sure all of those datasets keep a consistent schema and that they are using all possible entry types. Those virtual datasets would need to follow a very strict backward compatibility policy and proper schema evolution.

This would make it so much easier and more realistic to test not only the entire pipeline but also stand-alone scalar functions as we can also create helpers that would just give us one row (and use them inside fake extractors)

I would put those into src/core/etl/tests/Flow/ETL/Tests/Double/Fake/Dataset

@norberttech norberttech moved this to Todo in Roadmap Jan 4, 2025
@norberttech norberttech added this to the 0.11.0 milestone Jan 4, 2025
@norberttech
Copy link
Member Author

The important part here is that those datasets can't be total random, they need to be fully predictable.

For example, Orders can't start from a random point in time, it should also be possible to configure on the extractor level how many orders per day, time period, % of cancelled orders etc.

@norberttech norberttech added developer experience Resolving this issue should improve development experience for the library users. testing labels Jan 4, 2025
@Bellangelo
Copy link
Contributor

Do we want the datasets to be mainly a static file that we manipulate or maybe we could utilize libraries such as https://fakerphp.org/ so we can add some controllable randomness into the play?

For example, we could provide a "schema" to faker and then the faker will fill the data for us.

What do you think @norberttech ?

@norberttech
Copy link
Member Author

Great question!
IMO data should be 100% generated by faker but we should put some options as I explained above to make those datasets more predictable.

Tests using those datasets should not rely on the values but more on the shape and size of the data.

@norberttech norberttech modified the milestones: 0.11.0, 0.12.0 Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
developer experience Resolving this issue should improve development experience for the library users. testing
Projects
Status: Todo
Development

No branches or pull requests

2 participants