[MGNLDAM-559] DAM Preview module cannot create thumbnail for PDF files version 1.4+ Created: 12/Mar/15  Updated: 11/Oct/16  Resolved: 19/Feb/16

Status: Closed
Project: Magnolia DAM Module
Component/s: None
Affects Version/s: 2.1.3
Fix Version/s: 2.1.4

Type: Bug Priority: Neutral
Reporter: Jordie Diepeveen Assignee: Ngoc Nguyenthanh
Resolution: Fixed Votes: 1
Labels: support
Remaining Estimate: 1d
Time Spent: 2d
Original Estimate: 3d

Issue Links:
Cloners
is cloned by MAGNOLIA-6555 Manage version of pdfbox in main Closed
Relates
Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Bug DoR:
[ ]* Steps to reproduce, expected, and actual results filled
[ ]* Affected version filled
Release notes required:
Yes
Date of First Response:
Epic Link: Update 3rd-party libraries for 5.5
Sprint: Saigon 31
Story Points: 5

 Description   

The DAM-preview module uses PDFBox from sub which cannot parse pdf file newer then version 1.4.

The following error will be thrown:
com.sun.pdfview.PDFParseException: Expected 'xref' at start of table
in class info.magnolia.imaging.operations.load#loadSource()

By using a new library (like Apache PDFBox) and updating the ViaPDFRenderer, a preview can be created within the dam-app browser
e.g.

pom.xml
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.1</version>
</dependency>
Bar.java
public class ViaPdfRenderer extends FromBinaryNode {
	
  @Override
  protected BufferedImage doReadAndClose(InputStream inputStream) throws   IOException, ImagingException {
        
	// Parse the PDF
	PDFParser pdfParser = new PDFParser(inputStream);
        pdfParser.parse();
        
        // Get the document
        PDDocument document = pdfParser.getPDDocument();
        
        // Get all pages
        @SuppressWarnings("unchecked")
	List<PDPage> pages = document.getDocumentCatalog().getAllPages();
        if (pages == null || pages.size() == 0) {
        	document.close();
        	return null;
        }
        
        // Get the first page
        PDPage page = pages.get(0);
        
        // Generate the image
        BufferedImage image = page.convertToImage();
        
        // Close the document
        document.close();
        
        return image;
  }
	
}


 Comments   
Comment by Ngoc Nguyenthanh [ 02/Feb/16 ]

Resolve by

  • Replace the library Swinglabs Pdf-renderer by Apache PDFBox version 1.8.11
  • Tested with PDF sample files, versions from 1.3 to 1.7, using files provided by Apache Tika
Comment by Mikaël Geljić [ 18/Feb/16 ]

Requires release notes:

  • Optional dam-preview module now uses Apache PDFBox to render pdf previews, instead of Pdf-renderer.
Comment by Mikaël Geljić [ 18/Feb/16 ]

Actually, just noticed we had pdfbox in webapps before already (1.8.6), brought transitively by tika-parsers. When using dam-preview, version 1.8.11 here will "shadow" the other one and could potentially have runtime incompatibilities. And there are such incompatibilities, as reported by clirr:

ERROR: 7005: org.apache.pdfbox.pdfparser.NonSequentialPDFParser: Parameter 1 of 'protected void decrypt(org.apache.pdfbox.cos.COSString, long, long)' has changed its type to org.apache.pdfbox.cos.COSBase
ERROR: 7005: org.apache.pdfbox.pdfparser.NonSequentialPDFParser: Parameter 2 of 'protected void decrypt(org.apache.pdfbox.cos.COSString, long, long)' has changed its type to int
ERROR: 7005: org.apache.pdfbox.pdfparser.NonSequentialPDFParser: Parameter 3 of 'protected void decrypt(org.apache.pdfbox.cos.COSString, long, long)' has changed its type to int
ERROR: 7005: org.apache.pdfbox.pdfwriter.COSWriter: Parameter 2 of 'public COSWriter(java.io.OutputStream, java.io.FileInputStream)' has changed its type to java.io.InputStream
ERROR: 7005: org.apache.pdfbox.pdmodel.PDDocument: Parameter 1 of 'public void saveIncremental(java.io.FileInputStream, java.io.OutputStream)' has changed its type to java.io.InputStream
ERROR: 7013: org.apache.pdfbox.pdmodel.encryption.SecurityHandler: Abstract method 'public boolean hasProtectionPolicy()' has been added
ERROR: 7005: org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler: Parameter 4 of 'public byte[] getUserPassword(byte[], byte[], int, long)' has changed its type to int
ERROR: 7002: org.apache.pdfbox.pdmodel.graphics.PDExtendedGraphicsState: Method 'public java.lang.Float getNonStrokingAlpaConstant()' has been removed
ERROR: 7002: org.apache.pdfbox.pdmodel.graphics.PDExtendedGraphicsState: Method 'public java.lang.Float getStrokingAlpaConstant()' has been removed
ERROR: 7004: org.apache.pdfbox.pdmodel.graphics.shading.AxialShadingContext: In method 'public AxialShadingContext(org.apache.pdfbox.pdmodel.graphics.shading.PDShadingType2, java.awt.image.ColorModel, java.awt.geom.AffineTransform, org.apache.pdfbox.util.Matrix, int)' the number of arguments has changed
ERROR: 1001: org.apache.pdfbox.pdmodel.graphics.shading.GouraudShadingContext: Decreased visibility of class from public to package
ERROR: 8001: org.apache.pdfbox.pdmodel.graphics.shading.GouraudTriangle: Class org.apache.pdfbox.pdmodel.graphics.shading.GouraudTriangle removed
ERROR: 7004: org.apache.pdfbox.pdmodel.graphics.shading.RadialShadingContext: In method 'public RadialShadingContext(org.apache.pdfbox.pdmodel.graphics.shading.PDShadingType3, java.awt.image.ColorModel, java.awt.geom.AffineTransform, org.apache.pdfbox.util.Matrix, int)' the number of arguments has changed
ERROR: 1001: org.apache.pdfbox.pdmodel.graphics.shading.Type5ShadingContext: Decreased visibility of class from public to package
ERROR: 7006: org.apache.pdfbox.pdmodel.interactive.documentnavigation.destination.PDPageXYZDestination: Return type of method 'public int getZoom()' has been changed to float
ERROR: 7005: org.apache.pdfbox.pdmodel.interactive.documentnavigation.destination.PDPageXYZDestination: Parameter 1 of 'public void setZoom(int)' has changed its type to float
  • We should probably manage the pdfbox dependency like other apache libs, i.e. in main parent pom
  • We also need to check whether or not we stick to 1.8.6 to stay in line with tika dependency
    • if it handles PDF 1.4+ as well
    • or upgrade tika as well (more uncertain, would jump from 1.6 to 1.11, still to have *only* pdfbox 1.8.10), more incompatibilities around the corner too...
Comment by Mikaël Geljić [ 19/Feb/16 ]

Alright, unless strong blocker or regression, we stick with pdfbox 1.8.6. We will manage the version in main's parent pom along with a comment to keep it in sync with tika.
This will be complemented by upgrades of third-party libraries for 5.5.

Comment by Sang Ngo Huu [ 22/Feb/16 ]

QA: some pdf files from internet, and files provided by Apache Tika

  • Most of cases work smoothy, but the quality is not good. It comes from library
  • Slow if preview big file

For me, it is acceptable. The limited is from library. I will close this issue. Please reopen if need to update library or encounter new issue

Generated at Mon Feb 12 05:01:02 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.