This cli tool converts OSCAR's jsonl files into parquet. It takes Ungoliant's output as input and writes the parquet files to the destination folder. This tool intends to replace the splitting and compression steps of the OSCAR generation previously performed by oscar-tools.
- Add Python bindings
- Add tests
- Add option to control the maximum number of rows per parquet file
oscar2parquet -h
Converts OSCAR's jsonl files into parquet.
Usage: oscar2parquet [OPTIONS] <INPUT FOLDER> <DESTINATION FOLDER>
Arguments:
<INPUT FOLDER> Folder containing the indices
<DESTINATION FOLDER> Parquet file to write
Options:
-t, --threads <NUMBER OF THREADS> Number of threads to use [default: 10]
-h, --help Print help
-V, --version Print version