Is the ISO specification of PDF complete? #1073

Kopsky · 2023-11-22T14:30:04Z

Kopsky
Nov 22, 2023

Hello, I have a question concerning programers who create the software that works with PDF. I need to know, whether you used the ISO specification of PDF to create the software, and whether it was usefull/helpfull and ultimatelly, whether it is possible to say, that the ISO specification is COMPLETE in the sense, that it (along with the programing skills independent of the format) contains sufficient information to create a program that could READ PDF. Comments on both ISO 32000 and ISO 19005 are welcome. Thank you for your time.

moritzfl · 2024-01-21T08:17:45Z

moritzfl
Jan 21, 2024

Well ... PDF ist not just PDF - there is PDF 1 and PDF 2. Then there is PDF/A-3 which borrows some features from PDF 2 but is based on PDF 1 otherwise, yet the PDF/A-3 standard is a short document that requires a PDF 1 specification to fully understand it.

Depending on what you want to achieve you also do not want to reject slightly incorrect PDF documents as most readers will open them. A good example on how tolerant readers are that I always use to convey this to people is the following: https://stackoverflow.com/questions/17279712/what-is-the-smallest-possible-valid-pdf

In that regard, OpenPDF is fairly strict. It rejects many PDF files that other libraries will still read. PDFBox, on the other hand, is very tolerant to corrupt PDF files. Usually when PDFBox can not open a PDF file, this means that most viewers also will not open it. And this can be a design decision. Either you want to ensure that the input and output are absolutely correct and no misrepresentation of content happens or you are fine with guessing a bit and kind of doing a best effort interpretation of the input.

If you are in a commercial setting, you will hear this a lot: "But Adobe opens it - why can't you read it". Sometimes you have to implement a fallback similar to what Adobe does and at other times it might not be viable for your production setting to have such a fallback. Many times and depending on the actual usage scenario, I would consider Adobes fallbacks to be too risky (i.e. when reading page content from a font with missing toUnicode mappings).

And then of course there are issues with faulty PDFs that can cause your software to throw unexpected errors, Nullpointers etc. You need to handle situations where you expect a PDF-String but get a Dictionary-Object.

tldr;

you need a good overview over all PDF standards. Starting with PDF 1 (all versions - for reading you also need to take care of backwards compatibility of deprecated features) up to PDF 2 - and then you need to cover corner cases of the PDF/A standards. And that is the bare minimum.

Long story short. Writing a PDF Reader from scratch is not easy at all. OpenHub estimates that OpenPDF took an estimated 27 years of effort (COCOMO model).

0 replies

asturio · 2024-03-04T14:06:13Z

asturio
Mar 4, 2024
Maintainer

Closing as a duplicate of #994

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the ISO specification of PDF complete? #1073

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is the ISO specification of PDF complete? #1073

Kopsky Nov 22, 2023

Replies: 2 comments

moritzfl Jan 21, 2024

asturio Mar 4, 2024 Maintainer

Kopsky
Nov 22, 2023

moritzfl
Jan 21, 2024

asturio
Mar 4, 2024
Maintainer