[MAGNOLIA-7123] Full text search in documents (pdf, doc, docx) does not work anymore Created: 28/Aug/17  Updated: 06/Jan/20  Resolved: 03/Apr/18

Status: Closed
Project: Magnolia
Component/s: core
Affects Version/s: 5.4.13, 5.5.4
Fix Version/s: 5.5.10, 5.6.3

Type: Bug Priority: Major
Reporter: Federico Grilli Assignee: Leah Staniorski
Resolution: Fixed Votes: 4
Labels: regression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Cloners
is cloned by MAGNOLIA-7398 Full text search in documents (pdf, ... Closed
causality
dependency
is depended upon by MGNLDAM-737 Fix Assets app search to handle searc... Closed
is depended upon by MGNLCE-153 Create Test for PDF Search Closed
relation
is related to MGNLDAM-667 Fix full text dam search (Assets app) Closed
is related to MGNLDAM-442 Allow full-text search for common/pop... Closed
Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Bug DoR:
[ ]* Steps to reproduce, expected, and actual results filled
[ ]* Affected version filled
Release notes required:
Yes
Date of First Response:
Sprint: Basel 135, Basel 136, Basel 137, Basel 138, Basel 141
Story Points: 5

 Description   

As reported by Tom Wespi in the User forum (and verified by yours truly): indexing and searching of PDF files (and possibly other formats too) no longer works in Magnolia. It used to work out of the box at least until version 5.3.11 and 5.4.6 respectively.
Since the PDF indexing should happen automatically, it being a feature of JR which doesn't require any special configuration, we should figure out what brought to this regression.

Here is the related forum thread https://groups.google.com/a/magnolia-cms.com/forum/#!searchin/user-list/search$20for$20pdf%7Csort:date/user-list/Jt0u3ihhh9w/JHQck-QKAwAJ



 Comments   
Comment by Frank Sommer [ 30/Aug/17 ]

In M5.5.2 (dam 2.2.2) it works.

Comment by Federico Grilli [ 30/Aug/17 ]

Thank you frank.sommer. I've just ascertained that the feature stopped working in 5.5.4. Which, as rgange had already mentioned, leads us to the following possible culprit https://documentation.magnolia-cms.com/display/DOCS/Release+notes+for+Magnolia+CORE+5.5.4#ReleasenotesforMagnoliaCORE5.5.4-ApacheTika1.14

Comment by Richard Gange [ 30/Aug/17 ]

Yes, I think that tika update also triggered an update with pdfbox. That could also be playing a role here.

We went from tika 1.6 to 1.14 and pdfbox 1.8.6 to 2.0.3

Comment by Richard Gange [ 30/Aug/17 ]

Migration guide: https://pdfbox.apache.org/2.0/migration.html

Comment by Richard Gange [ 01/Sep/17 ]

Further information about this issue.JR 2.12.4 ships with tika 1.7. See http://central.maven.org/maven2/org/apache/jackrabbit/jackrabbit-parent/2.12.4/jackrabbit-parent-2.12.4.pom

Most likely this issue can be fixed by overriding the managed version of tika to be:

<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.7</version>
</dependency>
Comment by Richard Gange [ 01/Sep/17 ]

Possibly related: https://issues.apache.org/jira/browse/TIKA-1285

Comment by Tom Wespi [ 01/Sep/17 ]

After adding/overwriting following jars with the versions used in 5.5.3, PDF and TXT files are indexed again in a 5.5.5 instance. After deleting the index, files in DAM get parsed and indexed again.

<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.6</version>
</dependency>
Comment by Federico Grilli [ 22/Sep/17 ]

Thank you tomwespi and rgange for your investigations and contributions.

Comment by Pietro Pagani [ 29/Nov/17 ]

Hi,
we also have faced this issue working with version 5.5.6 of Magnolia.
By applying the changes in the pom suggested by Tom Wespi, we have partially solved the problem and full text search on PDF now is working fine.
I say "partially" because full text search over .doc and .docx is still not working.

We have uploaded three documents containing the same content in three formats: pdf, doc and docx and the following query returned only the pdf.

SELECT * FROM [nt:base] AS p WHERE p.[jcr:primaryType] = 'mgnl:resource' AND contains(p.*, 'xxx')

Regards,
Pietro

Comment by Matjaz Trcek [ 16/Mar/18 ]

QA'd both 5.5 and 5.6 

Works as expected

Comment by Federico Grilli [ 29/Mar/18 ]

lstaniorski While doing QA on the soon to be release 5.6.4 bundle czimmermann found out that this does not work in the community web app while it works on demo bundle. This means the issue is only partially solved. We can keep the changes made so far in, as they are needed anyway. We are probably missing the needed libraries in magnolia-community-webapp. Anyhow, we can't advertise this as fixed now.

Comment by Federico Grilli [ 03/Apr/18 ]

After further testing it turned out to work perfectly well. czimmermann and I were fooled while doing QA by mgnl jumpstart -s command which for some reason downloaded a DEBUG-SNAPSHOT version of the web app which has the wrong library versions. See NPMCLI-177 

Generated at Mon Feb 12 04:20:56 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.