Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] Add EPub to Blocks converter #2097

Merged
merged 2 commits into from
Jan 9, 2025
Merged

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Dec 17, 2024

Adds WP_EPub_Entity_Reader to parse EPub files into WordPress posts and post meta:

$reader = new WP_EPub_Entity_Reader(
	new WP_Zip_Filesystem(
        WP_Remote_File_Ranged_Reader::create( 
            'https://github.com/IDPF/epub3-samples/releases/download/20230704/childrens-literature.epub'
        )
    )
);

foreach($reader as $entity) {
	print_r( $entity );
}
// prints three arrays representing WordPress posts with content represented as block markup

A part of #1894

Implementation details

EPub are ZIP files containing content represented as XHTML. They may include other assets, too, e.g., CSS, images, table of contents, metadata, etc.

This PR glues together WP_Zip_Filesystem with WP_HTML_To_Blocks to find all the XHTML files in the zip and convert them to block markup.

Since XHTML uses XML syntax that cannot be parsed via WP_HTML_Processor, we use WP_XML_Processor instead. To support XHTML, this PR adds support for parsing simple <!DOCTYPE html> declarations in WP_XML_Processor.

This PR also enables swapping WP_HTML_Processor for WP_XML_Processor in WP_HTML_To_Blocks by adding a WP_XML_Processor::expects_closer() method. It doesn't have exactly the same semantics as the WP_HTML_Processor one, but it's close enough.

Remaining work

Right now, we're guessing the location of all the XML files. It works for the test example above, but to support all the epub files out there, we'd need to:

  • Parse the META-INF/container.xml config file to get the root file path.
  • Parse the root file to extract the paths of the content XHTML files, and potentially metadata such as authors, titles, pages, etc.
  • Discuss mapping the EPub structure into WordPress entities. We have files, chapters, and content pages. What should one WordPress page represent once the import is finished?

Open questions

  • Should we introduce a common WP_Markup_Processor interface to represent a subset of methods shared between the HTML processor and the XML processor?
  • Should we introduce class WP_XHTML_Processor extends WP_XML_Processor to align the semantics of expects_closer() and other overlapping methods?
  • Should we only consider the OEBPS and EPUB directories inside the epub file? Or can XHTML be stored under another path? How would we know?

cc @ellatrix @dmsnell @zaerl @brandonpayton @sirreal

@sirreal
Copy link
Member

sirreal commented Dec 18, 2024

Should we introduce a common WP_Markup_Processor interface to represent a subset of methods shared between the HTML processor and the XML processor?

On the surface this seems nice, then folks could program against a common interface. I'd like to have a good understanding of what the common interface would be and how it would be used. An example might be the selectors work that could be used to navigate documents this common interface. Would the tag processor also implement this interface? Is the tag processor expected to be the base class for many of these other processors?

I also wonder if there are enough subtle differences that the common interface would be cumbersome without much tangible benefit. Again from the selectors work, when matching the ID selector, it needs to know if the document is in quirks mode to determine whether the match is case sensitive or insensitive. That's a purely HTML concept. Maybe XML documents are documents that are never in quirks mode, but it's something to think about.

@adamziel
Copy link
Collaborator Author

@sirreal all good points! Maybe we'd need a separate XHTML processor to integrate the selectors work, then, and the common interface would apply to XHTML and HTML, not XML and HTML.

@sirreal
Copy link
Member

sirreal commented Dec 18, 2024

Selectors are a great interface for navigating trees, they're a great fit for XML. I just wanted to mention a quirk (😉) I noticed with selectors matching the HTML Processor. I'd love for something like select and select_all to be part of the interface if we decide to implement one.

Base automatically changed from refactor-readers to trunk January 8, 2025 21:18
@zaerl zaerl requested a review from a team as a code owner January 8, 2025 21:18
Description TBD

Use WP_XML_Reader for EPubs, support simple DOCTYPE declarations in XML

Parse EPubs as XHTML
@adamziel adamziel merged commit b9b7fcc into trunk Jan 9, 2025
10 checks passed
@adamziel adamziel deleted the add-epub-reader branch January 9, 2025 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants