CSV::from_file does not work with remote URLs #239
Replies: 6 comments 5 replies
-
This is literally one of the last problems that need to be solved before we can release version 0.1.0. Read/Write Streams Long story short is that at this point all adapters are expecting files (except log, elasticsearch and doctrine dbal adapters). passing resource: $extractor = CSV::from_resource(
resource: fopen('s3://my-bucket-name/test.csv'),
header_offset: 0,
); Would work for CSV but not for XML adapter, since internally we are using there XMLReader::open that only accepts string argument not resource (I believe XMLReader is initializing stream under the hood) . Using file_name to pass remote streams. (new Flow())
->extract(CSV::from_file(
file_name: 's3://my-bucket-name/test.csv',
header_offset: 0,
))
// ...
->run(); This one would be a bit more flexible but as you noticed we would need to hack a bit file existence validation. I came up with another solution for that problem, so basically etl would need to define a very generic Stream class. final class Stream
{
public function __construct(private readonly $path, Mode $mode, private readonly array $options = []);
public static function local(string $file_path, Mode $mode, array $options) : self
{
// validation
return new self("file://" . $file_path, $mode, $options);
}
public static function remote(string $path, Mode $mode, array $options) : self
{
// validation
return new self($path, Mode::read, $mode, $options);
}
public function options() : array;
public function path() : array;
public function mode() : Mode;
} We could use it as an argument for all Loaders/Extractors, it would be literally just a DTO that would pass $path, $mode and $options to adapter implementation. Then on the DSL layer, we could simply enforce some unified function names like for example: CSV::from(Stream::read_local("/path/to/file.csv"), ...);
CSV::from(Stream::read_remote("abfs://path/to/file.csv"), ...);
CSV::to(Stream::write_local("/path/to/file.csv"), ...);
CSV::to(Stream::write_remote("abfs://path/to/file.csv"), ...); That would be obviously a BC Break but it would also open all adapters for steams and which brings us to the second idea. Seekable Remote Streams We could try to create a custom stream for each adapter provided by Flysystem. So at the end of the day in your case, the CSV DSL would look something like this: CSV::from(Stream::read_remote("s3://path/to/file.csv", $awsClientOptions), ...); Under the hood, DSL would pass the Stream instance to selected Adapter. I need to validate across all adapters if libraries used by them can work with
I just got stuck at writing to single/multiple files. It would need to be something like:
and I'm not sure if I like it, |
Beta Was this translation helpful? Give feedback.
-
Couple of things:
|
Beta Was this translation helpful? Give feedback.
-
It seems to me that the parallel/async features have created a lot of complexity throughout the app. Personally, I do not think the DSL "readability" is a strong enough reason to use a different method naming convention. It feels (and looks) awkward with surrounding code. |
Beta Was this translation helpful? Give feedback.
-
Great progress! |
Beta Was this translation helpful? Give feedback.
-
I'm happy to announce that FileStream abstraction was added to the core ETL library. It also seems that we finally have a unified and pretty straightforward DSL for reading/writing, just take a look at JSON and Stream DSL's. Just two methods: JSON::from(FileStream|array<FileStream>|string)
JSON::to(FileStream|array<FileStream>|string) which works great with: Stream::local_file() : LocalFile
Stream::local_directory() : array<LocalFile>
Stream::aws_s3_file() : RemoteFile
Stream::aws_s3_directory() : array<RemoteFile>
... Now we just need to adjust the remaining file format adapters to follow this convention ✌️ |
Beta Was this translation helpful? Give feedback.
-
And the CSV Adapter was updated (will merge it later today): |
Beta Was this translation helpful? Give feedback.
-
I want to be able to extract documents stored on S3, so I have registered the stream wrapper:
And then I attempt to load the file:
But the following exception is thrown:
I think the check for file existence should be modified to allow remote URLs:
OR add another method to handle resources, eg:
Beta Was this translation helpful? Give feedback.
All reactions