[MAGNOLIA-6548] SearchHTMLExcerpt: Should also check for html end-tags and not closed openening tags and remove them Created: 15/Feb/16  Updated: 30/Aug/21

Status: Accepted
Project: Magnolia
Component/s: None
Affects Version/s: 6.2
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Christian Ringele Assignee: Unassigned
Resolution: Unresolved Votes: 3
Labels: support
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File broken-highlight.png    
Issue Links:
Relates
causality
relation
is related to MGNLDEMO-144 Make the w3 validation proof: Mainly ... Closed
is related to MGNLDEMO-135 Search result page might brake layout... Closed
Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Date of First Response:

 Description   

When dropping a JCR query, the found row contains a excerpt snipped of the searched term highlighted and text before and after the found therm.
The excerpt is done and processed by out class info.magnolia.jackrabbit.lucene.SearchHTMLExcerpt .

Problem:
If pages contain RichText content, containing a lot of li & ul tags, it is likely that the excerpt snipped will have a cut off </li> or </ul> tag form the content.
So in the search results displaying, there is a wrong </li> or </ul> markup.
Also it can happen, that at the end of the excerpt snipped a lead <li> or <ul> is fetched from the content, that are never closed.

Solution:
The class SearchHTMLExcerpt should check for wrong closing tags, and for opening tags that are never closed, and remove them from he excerpt snippet.



 Comments   
Comment by Richard Gange [ 19/Apr/16 ]

The best way I can see to change this behavior would be to identify the names of all the properties which store or might store HTML. For example, the HTML component usually stores it's content in a property called editHTML.

Next you need to create a custom indexing configuration file. In that indexing configuration file you need to specify a custom analyzer for the fields identified in the first step. See http://wiki.apache.org/jackrabbit/IndexingConfiguration. It's probably a good idea to start with a copy of the one provided in the core module and then add this configuration as many times as needed. One for each property that needs the special analyzer.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <analyzers> 
        <analyzer class="info.magnolia.jackrabbit.lucene.analysis.HTMLStripCharAnalyzer">
            <property>editHtml</property>
        </analyzer>
  </analyzers> 
</configuration>

With this configuration part, you define how a property should be analyzed. If a property has an analyzer configured, this analyzer is used for indexing and searching this property. In the workspace.xml for the website workspace (or whatever workspace needs the special configuration) set the indexingConfiguration to point to your file. If you put the custom configuration in the workspace it would look like this.

<param name="indexingConfiguration" value="${wsp.home}/indexing_configuration.xml"/>

Then you need to create an analyzer which uses the HTML Strip filter. Here I overrode initReader() to use a filter.

package info.magnolia.jackrabbit.lucene.analysis;

import java.io.Reader;

import org.apache.lucene.analysis.charfilter.HTMLStripCharFilter;
import org.apache.lucene.analysis.CharReader;
import org.apache.lucene.analysis.ReusableAnalyzerBase;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.util.Version;

public class HTMLStripCharAnalyzer extends ReusableAnalyzerBase {
	
	private final Version matchVersion;

	public HTMLStripCharAnalyzer(Version matchVersion) {
		this.matchVersion = matchVersion;
	}
	
	@Override
	protected Reader initReader(Reader reader) {
		return super.initReader(new HTMLStripCharFilter(CharReader.get(reader)));
	}

	@Override
	protected TokenStreamComponents createComponents(String fieldName,
			Reader reader) {
		// TODO Auto-generated method stub
		return new TokenStreamComponents(new WhitespaceTokenizer(matchVersion, reader));
	}
}

Be sure to reindex all involved workspaces.

Comment by Richard Gange [ 19/Apr/16 ]

A side effect of the workaround is that any property that was filtered will now have the highlighting thrown off. Other fields are fine.

Comment by Richard Gange [ 19/Apr/16 ]

IMO there is no reason to fix this issue. The problem at it's core is storing html in the content. This is not something every customer does. Those that choose to do it have the option to configure how the data is indexed. But we should not add additional overhead to the search process to look for unclosed html that may or may not exist.

Options:

  1. The method described above. Special indexing configuration for fields which store HTML.
  2. Extend SearchHTMLExcerpt and use JSoup to remove HTML from the excerpt. The would remove the highlighting as well.
  3. Don't store HTML in content.
Comment by Christian Ringele [ 13/Sep/16 ]

Re-opened issue because:
CKEditor is always strong html into the content.
And the problem in the search results were CKEditor generated content.

CKEditor stores UL's and LI's, which is valid and default vanialla Magnolia behavior.

Comment by Richard Gange [ 30/May/17 ]

There was a nice workaround mentioned for this issue on: https://documentation.magnolia-cms.com/display/DOCS/Search

We are using following freemarker code to filter out html tags in the excerpt. Otherwise html code can break due to unclosed tags.

${item.excerpt?replace('<[^>]*>', '', 'r')!}
Generated at Mon Feb 12 04:15:34 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.