[MGNLEESOLR-51] Crawler breaks if fieldMapping selector using attr(0,content) cannot find any content Created: 11/Mar/15 Updated: 24/Apr/15 Resolved: 14/Apr/15 |
|
| Status: | Closed |
| Project: | Solr Search Provider |
| Component/s: | None |
| Affects Version/s: | 2.0 |
| Fix Version/s: | 2.2 |
| Type: | Bug | Priority: | Neutral |
| Reporter: | Edgar Vonk | Assignee: | Milan Divilek |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Template: |
|
| Acceptance criteria: |
Empty
|
| Task DoD: |
[ ]*
Doc/release notes changes? Comment present?
[ ]*
Downstream builds green?
[ ]*
Solution information and context easily available?
[ ]*
Tests
[ ]*
FixVersion filled and not yet released
[ ] 
Architecture Decision Record (ADR)
|
| Bug DoR: |
[ ]*
Steps to reproduce, expected, and actual results filled
[ ]*
Affected version filled
|
| Date of First Response: |
| Description |
|
We ran into the following issue while using a website crawler in the new Solr module using a selector with field mapping: meta[name=description] attr(0,content) The issue is: the crawler breaks if there is no 'description' HTML meta keyword in the HTML page in question with the following error in the logs: 2015-03-10 14:00:00,427 ERROR edu.uci.ics.crawler4j.crawler.WebCrawler : Index: 0, Size: 0, while processing: [..] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at org.jsoup.select.Elements.get(Elements.java:523) at info.magnolia.module.indexer.crawler.MgnlCrawler.treatFieldMappings(MgnlCrawler.java:142) at info.magnolia.module.indexer.crawler.MgnlCrawler.visit(MgnlCrawler.java:109) at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:306) at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:189) at java.lang.Thread.run(Thread.java:745) With as result that no content at all is indexed for the page in question.. We do want to use the 'description' meta tag but cannot guarantee that it is available on all pages. It is possible to make the MgnlCrawler more robust so that it will not break for this scenario? |
| Comments |
| Comment by Edgar Vonk [ 12/Mar/15 ] |
|
The fix should be quite easy. There should be a check to see if there are any elements before doing the '.get'. Something like: elements = document.select(entry.getValue().substring(0, matcher.start() - 1)); if (elements.size() >1) { text = content.get(Integer.parseInt(matcher.group(1))).attr(matcher.group(2)); } |
| Comment by Edgar Vonk [ 12/Mar/15 ] |
|
A little more code would be required I guess. Check if the elements list contains an element at the index in question before retrieving it, or catch the arrayindexoutofboundsexception. |
| Comment by Jan Haderka [ 13/Apr/15 ] |
|
Test? |