[Data Liberation] Add EPub to Blocks converter #2097
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds
WP_EPub_Entity_Reader
to parse EPub files into WordPress posts and post meta:A part of #1894
Implementation details
EPub are ZIP files containing content represented as XHTML. They may include other assets, too, e.g., CSS, images, table of contents, metadata, etc.
This PR glues together WP_Zip_Filesystem with WP_HTML_To_Blocks to find all the XHTML files in the zip and convert them to block markup.
Since XHTML uses XML syntax that cannot be parsed via
WP_HTML_Processor
, we useWP_XML_Processor
instead. To support XHTML, this PR adds support for parsing simple<!DOCTYPE html>
declarations in WP_XML_Processor.This PR also enables swapping
WP_HTML_Processor
forWP_XML_Processor
inWP_HTML_To_Blocks
by adding aWP_XML_Processor::expects_closer()
method. It doesn't have exactly the same semantics as theWP_HTML_Processor
one, but it's close enough.Remaining work
Right now, we're guessing the location of all the XML files. It works for the test example above, but to support all the epub files out there, we'd need to:
META-INF/container.xml
config file to get the root file path.Open questions
WP_Markup_Processor
interface to represent a subset of methods shared between the HTML processor and the XML processor?class WP_XHTML_Processor extends WP_XML_Processor
to align the semantics ofexpects_closer()
and other overlapping methods?OEBPS
andEPUB
directories inside the epub file? Or can XHTML be stored under another path? How would we know?cc @ellatrix @dmsnell @zaerl @brandonpayton @sirreal