[MAGNOLIA-7123] Full text search in documents (pdf, doc, docx) does not work anymore Created: 28/Aug/17 Updated: 06/Jan/20 Resolved: 03/Apr/18 |
|
| Status: | Closed |
| Project: | Magnolia |
| Component/s: | core |
| Affects Version/s: | 5.4.13, 5.5.4 |
| Fix Version/s: | 5.5.10, 5.6.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Federico Grilli | Assignee: | Leah Staniorski |
| Resolution: | Fixed | Votes: | 4 |
| Labels: | regression | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Template: |
|
||||||||||||||||||||||||||||||||||||
| Acceptance criteria: |
Empty
|
||||||||||||||||||||||||||||||||||||
| Task DoD: |
[ ]*
Doc/release notes changes? Comment present?
[ ]*
Downstream builds green?
[ ]*
Solution information and context easily available?
[ ]*
Tests
[ ]*
FixVersion filled and not yet released
[ ] 
Architecture Decision Record (ADR)
|
||||||||||||||||||||||||||||||||||||
| Bug DoR: |
[ ]*
Steps to reproduce, expected, and actual results filled
[ ]*
Affected version filled
|
||||||||||||||||||||||||||||||||||||
| Release notes required: |
Yes
|
||||||||||||||||||||||||||||||||||||
| Date of First Response: | |||||||||||||||||||||||||||||||||||||
| Sprint: | Basel 135, Basel 136, Basel 137, Basel 138, Basel 141 | ||||||||||||||||||||||||||||||||||||
| Story Points: | 5 | ||||||||||||||||||||||||||||||||||||
| Description |
|
As reported by Tom Wespi in the User forum (and verified by yours truly): indexing and searching of PDF files (and possibly other formats too) no longer works in Magnolia. It used to work out of the box at least until version 5.3.11 and 5.4.6 respectively. Here is the related forum thread https://groups.google.com/a/magnolia-cms.com/forum/#!searchin/user-list/search$20for$20pdf%7Csort:date/user-list/Jt0u3ihhh9w/JHQck-QKAwAJ |
| Comments |
| Comment by Frank Sommer [ 30/Aug/17 ] |
|
In M5.5.2 (dam 2.2.2) it works. |
| Comment by Federico Grilli [ 30/Aug/17 ] |
|
Thank you frank.sommer. I've just ascertained that the feature stopped working in 5.5.4. Which, as rgange had already mentioned, leads us to the following possible culprit https://documentation.magnolia-cms.com/display/DOCS/Release+notes+for+Magnolia+CORE+5.5.4#ReleasenotesforMagnoliaCORE5.5.4-ApacheTika1.14 |
| Comment by Richard Gange [ 30/Aug/17 ] |
|
Yes, I think that tika update also triggered an update with pdfbox. That could also be playing a role here. We went from tika 1.6 to 1.14 and pdfbox 1.8.6 to 2.0.3 |
| Comment by Richard Gange [ 30/Aug/17 ] |
|
Migration guide: https://pdfbox.apache.org/2.0/migration.html |
| Comment by Richard Gange [ 01/Sep/17 ] |
|
Further information about this issue.JR 2.12.4 ships with tika 1.7. See http://central.maven.org/maven2/org/apache/jackrabbit/jackrabbit-parent/2.12.4/jackrabbit-parent-2.12.4.pom Most likely this issue can be fixed by overriding the managed version of tika to be:
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.7</version>
</dependency>
|
| Comment by Richard Gange [ 01/Sep/17 ] |
|
Possibly related: https://issues.apache.org/jira/browse/TIKA-1285 |
| Comment by Tom Wespi [ 01/Sep/17 ] |
|
After adding/overwriting following jars with the versions used in 5.5.3, PDF and TXT files are indexed again in a 5.5.5 instance. After deleting the index, files in DAM get parsed and indexed again. <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core --> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.6</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers --> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.6</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox --> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>1.8.6</version> </dependency> |
| Comment by Federico Grilli [ 22/Sep/17 ] |
|
Thank you tomwespi and rgange for your investigations and contributions. |
| Comment by Pietro Pagani [ 29/Nov/17 ] |
|
Hi, We have uploaded three documents containing the same content in three formats: pdf, doc and docx and the following query returned only the pdf. SELECT * FROM [nt:base] AS p WHERE p.[jcr:primaryType] = 'mgnl:resource' AND contains(p.*, 'xxx') Regards, |
| Comment by Matjaz Trcek [ 16/Mar/18 ] |
|
QA'd both 5.5 and 5.6 Works as expected |
| Comment by Federico Grilli [ 29/Mar/18 ] |
|
lstaniorski While doing QA on the soon to be release 5.6.4 bundle czimmermann found out that this does not work in the community web app while it works on demo bundle. This means the issue is only partially solved. We can keep the changes made so far in, as they are needed anyway. We are probably missing the needed libraries in magnolia-community-webapp. Anyhow, we can't advertise this as fixed now. |
| Comment by Federico Grilli [ 03/Apr/18 ] |
|
After further testing it turned out to work perfectly well. czimmermann and I were fooled while doing QA by mgnl jumpstart -s command which for some reason downloaded a DEBUG-SNAPSHOT version of the web app which has the wrong library versions. See NPMCLI-177 |