[MAGNOLIA-6548] SearchHTMLExcerpt: Should also check for html end-tags and not closed openening tags and remove them Created: 15/Feb/16 Updated: 30/Aug/21 |
|
| Status: | Accepted |
| Project: | Magnolia |
| Component/s: | None |
| Affects Version/s: | 6.2 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Christian Ringele | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 3 |
| Labels: | support | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Template: |
|
||||||||||||||||||||
| Acceptance criteria: |
Empty
|
||||||||||||||||||||
| Task DoD: |
[ ]*
Doc/release notes changes? Comment present?
[ ]*
Downstream builds green?
[ ]*
Solution information and context easily available?
[ ]*
Tests
[ ]*
FixVersion filled and not yet released
[ ] 
Architecture Decision Record (ADR)
|
||||||||||||||||||||
| Date of First Response: | |||||||||||||||||||||
| Description |
|
When dropping a JCR query, the found row contains a excerpt snipped of the searched term highlighted and text before and after the found therm. Problem: Solution: |
| Comments |
| Comment by Richard Gange [ 19/Apr/16 ] |
|
The best way I can see to change this behavior would be to identify the names of all the properties which store or might store HTML. For example, the HTML component usually stores it's content in a property called editHTML. Next you need to create a custom indexing configuration file. In that indexing configuration file you need to specify a custom analyzer for the fields identified in the first step. See http://wiki.apache.org/jackrabbit/IndexingConfiguration. It's probably a good idea to start with a copy of the one provided in the core module and then add this configuration as many times as needed. One for each property that needs the special analyzer. <?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <analyzers> <analyzer class="info.magnolia.jackrabbit.lucene.analysis.HTMLStripCharAnalyzer"> <property>editHtml</property> </analyzer> </analyzers> </configuration> With this configuration part, you define how a property should be analyzed. If a property has an analyzer configured, this analyzer is used for indexing and searching this property. In the workspace.xml for the website workspace (or whatever workspace needs the special configuration) set the indexingConfiguration to point to your file. If you put the custom configuration in the workspace it would look like this. <param name="indexingConfiguration" value="${wsp.home}/indexing_configuration.xml"/> Then you need to create an analyzer which uses the HTML Strip filter. Here I overrode initReader() to use a filter. package info.magnolia.jackrabbit.lucene.analysis; import java.io.Reader; import org.apache.lucene.analysis.charfilter.HTMLStripCharFilter; import org.apache.lucene.analysis.CharReader; import org.apache.lucene.analysis.ReusableAnalyzerBase; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.util.Version; public class HTMLStripCharAnalyzer extends ReusableAnalyzerBase { private final Version matchVersion; public HTMLStripCharAnalyzer(Version matchVersion) { this.matchVersion = matchVersion; } @Override protected Reader initReader(Reader reader) { return super.initReader(new HTMLStripCharFilter(CharReader.get(reader))); } @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { // TODO Auto-generated method stub return new TokenStreamComponents(new WhitespaceTokenizer(matchVersion, reader)); } } Be sure to reindex all involved workspaces. |
| Comment by Richard Gange [ 19/Apr/16 ] |
|
A side effect of the workaround is that any property that was filtered will now have the highlighting thrown off. Other fields are fine.
|
| Comment by Richard Gange [ 19/Apr/16 ] |
|
IMO there is no reason to fix this issue. The problem at it's core is storing html in the content. This is not something every customer does. Those that choose to do it have the option to configure how the data is indexed. But we should not add additional overhead to the search process to look for unclosed html that may or may not exist. Options:
|
| Comment by Christian Ringele [ 13/Sep/16 ] |
|
Re-opened issue because: CKEditor stores UL's and LI's, which is valid and default vanialla Magnolia behavior. |
| Comment by Richard Gange [ 30/May/17 ] |
|
There was a nice workaround mentioned for this issue on: https://documentation.magnolia-cms.com/display/DOCS/Search
|