Uploaded image for project: 'Solr Search Provider'
  1. Solr Search Provider
  2. MGNLEESOLR-51

Crawler breaks if fieldMapping selector using attr(0,content) cannot find any content

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Neutral
    • 2.2
    • 2.0
    • None

    Description

      We ran into the following issue while using a website crawler in the new Solr module using a selector with field mapping:

      meta[name=description] attr(0,content)

      The issue is: the crawler breaks if there is no 'description' HTML meta keyword in the HTML page in question with the following error in the logs:

      2015-03-10 14:00:00,427 ERROR edu.uci.ics.crawler4j.crawler.WebCrawler          : Index: 0, Size: 0, while processing: [..]
      java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
      	at java.util.ArrayList.rangeCheck(ArrayList.java:653)
      	at java.util.ArrayList.get(ArrayList.java:429)
      	at org.jsoup.select.Elements.get(Elements.java:523)
      	at info.magnolia.module.indexer.crawler.MgnlCrawler.treatFieldMappings(MgnlCrawler.java:142)
      	at info.magnolia.module.indexer.crawler.MgnlCrawler.visit(MgnlCrawler.java:109)
      	at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:306)
      	at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:189)
      	at java.lang.Thread.run(Thread.java:745)
      

      With as result that no content at all is indexed for the page in question..

      We do want to use the 'description' meta tag but cannot guarantee that it is available on all pages.

      It is possible to make the MgnlCrawler more robust so that it will not break for this scenario?

      Checklists

        Acceptance criteria

        Attachments

          Activity

            People

              mdivilek Milan Divilek
              edgar Edgar Vonk
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Checklists

                  Bug DoR
                  Task DoD