[MGNLEESOLR-51] Crawler breaks if fieldMapping selector using attr(0,content) cannot find any content Created: 11/Mar/15  Updated: 24/Apr/15  Resolved: 14/Apr/15

Status: Closed
Project: Solr Search Provider
Component/s: None
Affects Version/s: 2.0
Fix Version/s: 2.2

Type: Bug Priority: Neutral
Reporter: Edgar Vonk Assignee: Milan Divilek
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Template:
Acceptance criteria:
Empty
Task DoD:
[ ]* Doc/release notes changes? Comment present?
[ ]* Downstream builds green?
[ ]* Solution information and context easily available?
[ ]* Tests
[ ]* FixVersion filled and not yet released
[ ]  Architecture Decision Record (ADR)
Bug DoR:
[ ]* Steps to reproduce, expected, and actual results filled
[ ]* Affected version filled
Date of First Response:

 Description   

We ran into the following issue while using a website crawler in the new Solr module using a selector with field mapping:

meta[name=description] attr(0,content)

The issue is: the crawler breaks if there is no 'description' HTML meta keyword in the HTML page in question with the following error in the logs:

2015-03-10 14:00:00,427 ERROR edu.uci.ics.crawler4j.crawler.WebCrawler          : Index: 0, Size: 0, while processing: [..]
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
	at java.util.ArrayList.rangeCheck(ArrayList.java:653)
	at java.util.ArrayList.get(ArrayList.java:429)
	at org.jsoup.select.Elements.get(Elements.java:523)
	at info.magnolia.module.indexer.crawler.MgnlCrawler.treatFieldMappings(MgnlCrawler.java:142)
	at info.magnolia.module.indexer.crawler.MgnlCrawler.visit(MgnlCrawler.java:109)
	at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:306)
	at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:189)
	at java.lang.Thread.run(Thread.java:745)

With as result that no content at all is indexed for the page in question..

We do want to use the 'description' meta tag but cannot guarantee that it is available on all pages.

It is possible to make the MgnlCrawler more robust so that it will not break for this scenario?



 Comments   
Comment by Edgar Vonk [ 12/Mar/15 ]

The fix should be quite easy. There should be a check to see if there are any elements before doing the '.get'. Something like:

elements = document.select(entry.getValue().substring(0, matcher.start() - 1));
if (elements.size() >1) {
    text = content.get(Integer.parseInt(matcher.group(1))).attr(matcher.group(2));
}
Comment by Edgar Vonk [ 12/Mar/15 ]

A little more code would be required I guess. Check if the elements list contains an element at the index in question before retrieving it, or catch the arrayindexoutofboundsexception.

Comment by Jan Haderka [ 13/Apr/15 ]

Test?

Generated at Mon Feb 12 10:59:37 CET 2024 using Jira 9.4.2#940002-sha1:46d1a51de284217efdcb32434eab47a99af2938b.