You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently a data format (npc, tfrecord, etc) maps to a single sample type (image, text, etc).
We will need a method for the various container formats to support different sample types. I would recommend adding an abstraction for SampleType that describes how to create or process a type of record. A reader would use the SampleType instead of using hard coded methods.
For example: TFRecord only supports images but it should support samples used for DLRM as well.
The text was updated successfully, but these errors were encountered:
For any format, what the bytes represent is non-consequential. Sure there is an in-memory encoding into images but the samples themselves on disk are just bytes.
So if u have text u can store the bytes, it would not change the access pattern within TFRecord. Just the decoding of protobuf would change which is essentially not I/O.
In general, I'd like to agree but I'm looking at DLRM and it's sort of a unique case.
One popular data format for DLRM data is parquet. Since it's columnar the size of each data type determines data layout. Also, since DLRM is model parallel different accelerators will read different columns (instead of a single accelerator reading from every column for a single data sample). I don't know how to support this functionality in the current framework without hardcoding a bunch of values.
For parquet reading, the granularity of reading is generally row-groups. So we can emulate different column types using bytes dtype. So we would need what does the columns represent?
so we can have a new Param in dataset called types: [size of each column]. Add a validation to use this for parquet only for now. And then emulate it using Bytes type in parquet.
Currently a data format (npc, tfrecord, etc) maps to a single sample type (image, text, etc).
We will need a method for the various container formats to support different sample types. I would recommend adding an abstraction for SampleType that describes how to create or process a type of record. A reader would use the SampleType instead of using hard coded methods.
For example: TFRecord only supports images but it should support samples used for DLRM as well.
The text was updated successfully, but these errors were encountered: