[MAGNOLIA-6188] Provide index aggregate for mgnl:page in JR indexing configuration Created: 30/Apr/15  Updated: 15/Jul/20  Resolved: 03/Jun/15

Status: Closed
Project: Magnolia
Component/s: core
Affects Version/s: None
Fix Version/s: 5.4

Type: Improvement Priority: Major
Reporter: Federico Grilli Assignee: Federico Grilli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to MAGNOLIA-6240 Add integration test for new JR's ind... Open
causality
is causing MAGNOLIA-7231 Website specific rules for indexing a... Closed
is causing MAGNOLIA-7241 Prevent saving JCR session in version... Closed
dependency
is depended upon by MTE-19 Provide search functionality for temp... Closed
relation
is related to MAGNOLIA-5244 Use JR indexing configuration to fine... Closed
is related to MGNLFORUM-135 Provide search Closed
is related to MAGNOLIA-7847 Targeted indexing configurations Open
is related to DOCU-561 Document new features and improvement... Closed
is related to MAGNOLIA-6189 Provide highlighting option in search... Closed
Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Date of First Response:

 Description   

Contents in a page are found not only in the node whose primaryType is mgnl:page but also in its sub-nodes mgnl:area and mgnl:component. Adding the latter two to the JR index aggregate would allow for simpler, more accurate and possibly faster searches by simply doing e.g.

select * from [mgnl:page] as p where contains(p.*, 'foo')

rather than

select * from [nt:base] as p where contains(p.*, 'foo')

and then post-processing the results in Java by getting for each node the ancestor of type mgnl:page and finally discarding duplicates.



 Comments   
Comment by Magnolia International [ 05/May/15 ]

Patch is good, but test needs work. Will commit some examples shortly that show how to use matchers to create less verbose and more useful tests.
An assertion like assertTrue(result.getRows().nextRow().getValue("rep:excerpt()").getString().contains("<strong>Article</strong>")); is not very useful when it fails (you need to re-run the test and step in with a debugger, or print things out, to see what happened).

I would also be tempted to merge this and MAGNOLIA-6189 into a single issue (and perhaps commit), since the excerpt won't work without the changes to the indexing configuration. (the test will because tests use their own specific config file in src/test/resources/, but still ...)

Comment by Magnolia International [ 05/May/15 ]

Actually, I'm not sure the patch is good after all – when stepping in with the debugger, I see that the line above (where we get rep:excerpt) yields

<div><span>Articles dms 41828606-775b-4e60-999f-1cb30809d5bb Section: <strong>Article</strong> This is the banner text for the &quot;<strong>Article</strong>&quot; section. It is inherited by all child pages unless these are configured</span><span>above 1ffeccb6-bbaa-4d07-a89a-55cff9c18b71 Teaser to an <strong>Article</strong> dms 395a677e-5ed4-4cff-86c4-fd0122dd0c5e Another Teaser to Another <strong>Article</strong> d1b4b57d-3b5d-41ae-85f2-8ed2e4c186ea</span></div>

It doesn't seem like "dms" and UUIDs should be in a search excerpt ?

One more thing have a look at what I did a while ago for forum: MGNLFORUM-135. It was very ad-hoc, so not everything is valuable, and stuff is probably outdated, but the one thing I'd suggest we consider is the custom excerptProviderClass, which would probably help with styling.

Comment by Magnolia International [ 05/May/15 ]

Please see commit 1f21fbe. I modified the previously existing testTitlePropertyIsBoosted() to show how to advantageously use matchers (hopefully I'm not too biased). Left as an exercise to similarly improve the other tests

I also amended the new tests and make them fail (i.e by showing some flaws that I think should be fixed (uuids in excerpt) or simply by letting you complete the test (testIndexConfigurationAggregatesAreasAndComponentsIntoMgnlPageFullIndex should do more than just check the size of the result set, it should definitely check its content, otherwise how do you know aggregation works as you think it does ?)

Comment by Federico Grilli [ 08/May/15 ]

Pls use MAGNOLIA-6188_sq2 for reviewing. I split th job in two parts

  • first the index aggregate to improve full text searching in pages
  • then the custom excerpt part
    Regarding the latter: looking at excerpts obtained with a JCR query we realised that they may contain html tags, escaped (such as &quote; &lt; and &gt;) and not (such as <p>) which would mess with the existing markup of the page where they're used possibly screwing up the final outcome. Turns out that org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt is using DefaultHighlighter.highlight(..) which by default does the XML escaping of the tags it encounters. Hence the need for a custom ExcerptProvider stripping all unwanted tags and html entities.
Comment by Federico Grilli [ 10/May/15 ]

commit e17623a920e43 proposes an alternative solution not using indexing configuration for JCR identifiers. Basically it will strip them from the excerpt by default regardless of whether they're configured to be used or not in excerpts. One loses some flexibility but to me it looks like we're not losing much (I guess not many ppl want jcr ids in their search excerpts). This would also allow for an indexing config working sensibly out-of-the-box, rather than having to explain in docu that if you want to get rid of jcr ids you need to specify exactly the property names containing them in config.

Comment by Magnolia International [ 18/May/15 ]

Review comments:

  • What about "styling" the excerpt output with some css classes rather than div/span/strong ? It feels a bit invasive to do so, but at the same time I'm pretty sure front-enders will be thankful.
  • Why do we reinstantiate NoXMLEscapeHighlighter for every call ? (I'm not sure why Jackrabbit does it in its default impl, that thing has no state) - regardless that's the kind of behavior i like to see documented, to avoid someone tries to be smarter and breaks it later.
  • Why do we do no XML escape of the excerpt ? (I have no clue what the use-case is, but it just begs for a justification in its doc)
  • While I agree with your statement about not needing to exclude certain properties from indexing if we strip out UUIDs in the excerpt provider, we still need to
    • Make sure we properly document how-to-if-needed (e.g I'm not sure we want "superuser" in the excerpts either)
    • Make sure this is somewhat portable to other JCR implementations
Comment by Federico Grilli [ 19/May/15 ]

Thanks for the review.

What about "styling" the excerpt output with some css classes rather than div/span/strong ? It feels a bit invasive to do so, but at the same time I'm pretty sure front-enders will be thankful.

I actually added an excerpt class to the surrounding div. Agree on removing the span s. If I get it right you would like the excerpt string to be something like?

<div class="excerpt">blah blah blah <span class="highlight">foo</span> blah blah...</div>

This could be done but I guess a strong tag surrounding the highlighted term would be better, out of the box, cause you wouldn’t need to do anything to do the highlighting and you could still create your css class to modify the appearance of it.

Why do we reinstantiate NoXMLEscapeHighlighter for every call ?

You’re right, no need for that apparently

Why do we do no XML escape of the excerpt ?

Will add a comment in the code. Basically the default implementation does XML escaping by encoding illegal XML characters. This is a source of troubles when we want to strip all those unwanted HTML entities and tags. In particular the double quote char (“ escaped as &quot;) which, if you try it removing it before passing the text onto the doHighlight(..) method (as we do for other illegal chars) will mess up the positioning of the highlighting tag, e.g. Ar<strong>ticle</strong> instead of <strong>Article</strong>.

Make sure we properly document how-to-if-needed (e.g I'm not sure we want "superuser" in the excerpts either)

Sure, there’s a linked docu issue for that

Make sure this is somewhat portable to other JCR implementations

I’m afraid this won’t be possible - “This feature is Jackrabbit specific (introduced in version 1.3) and will not work with other JCR implementations.” http://wiki.apache.org/jackrabbit/ExcerptProvider

Comment by Magnolia International [ 02/Jun/15 ]

All good; I agree with your span vs strong observation, good point. Let's keep the span (with a class), since the <div#excerpt> can contain several such <span> fragments.

Re portability, it's clear this config and excerpt provider etc is Jackrabbit-only, what i meant is, ideally, we have pointers to how to do with other containers. IIRC I had an open issue on the Modeshape issue tracker about it (edit: found it - it also points to ExO's implementation of the same, so that might be useful for docu)

Comment by Magnolia International [ 03/Jun/15 ]

QueryUtilTest is broken

Comment by Magnolia International [ 03/Jun/15 ]

Fixed and reworked QueryUtilTest

Generated at Mon Feb 12 04:12:08 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.